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£N| , Abstract 

We use p-values as a discrepancy criterion for identifying the threshold value at which 
a regression function takes off from its baseline value - a problem that is motivated by 
applications in omics experiments, systems engineering, pharmacological dose-response 
studies and astronomy. In this paper, we study the problem in a controlled sampling 
setting, where multiple responses, discrete or continuous, can be obtained at a number 
of different covariate-levels. Our procedure involves testing the hypothesis that the re- 
gression function is at its baseline at each covariate value using the sampled responses at 
' that value and then computing the p-value of the test. An estimate of the threshold is 

provided by fitting a stump, i.e., a piecewise constant function with a single jump discon- 
tinuity, to the observed p-values, since the corresponding p-values behave in markedly 
different ways on different sides of the threshold. The estimate is shown to be consis- 
tent, as both the number of covariate values and the number of responses sampled at 
each value become large, and its finite sample properties are studied through an exten- 
sive simulation study. Our approach is computationally simple and can also be used to 
estimate the baseline value of the regression function. The procedure is illustrated on 
■ two motivating real data applications. Extensions to multiple thresholds are also briefly 

\& | investigated. 
H ! 

(*C) . Keywords: baseline value, change point, consistent estimate, controlled sampling, misspecificd 

model, stump function. 
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§ ; 1 Introduction and Problem Formulation 



In a number of applications, the data follow a regression model where the regression function 
fi is constant at its baseline value t up to a certain covariate threshold d° and stays above 
To at higher covariate levels. For example, consider the data in the left panel of Figure Q] 
that depict the average delay of customers as a function of the loading of a complex queueing 
system (more details about the system are given in Section It can be seen that for small 
loadings the delay is rather small, while a clear positive trend emerges for larger loadings. The 
system's operator is interested in identifying this threshold with high precision, as well as the 
level of the baseline To, since such knowledge specifies an operational regime of low average 
delay, whose value can be announced to potential customers. An example from a different 
scientific domain is shown in the right panel of Figure [TJ It depicts the expression levels of a 
gene over time obtained from multiple cell-lines, where again the function stays at its baseline 
level up to some time, then rises sharply and subsequently flattens out (more details about 
the underlying experiment are given in Section |5.3|) . This problem requires procedures that 
can handle multiple change-points of the regression function, namely where it deviates from 
the baseline value and also where it starts flattening out towards the end of the range of the 
covariate. These thresholds are of interest as they represent important stages of progression 
of the cell from normalcy to malignancy. 

Problems with identical structure also arise in pharmacological dose-response studies, where 
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Figure 1: Left panel: Data of delay versus loading from a complex queueing system. Right 
panel: Gene expression levels over time. 

fi(x) provides information about reaction to dose-level x and is typically at the baseline value 
up to a certain dose, referred to in the literature as the minimum effective dose (MED); see 
Chen and Chang (2007) and Tamhane and Logan (2002) and the references therein. Similar 
problems arise in toxicological applications; see, for example, Cox (1987), who uses para- 
metrically specified threshold models. Yet another field of application is astronomy; one of 
particular interest to the authors deals with estimating the "tidal" radius of a dwarf spheroidal 
galaxy (see Sen et al. (2009)). The mean velocity of the stars along the major axis of the 
dwarf spheroidal galaxy, as a function of the distance from the center, is assumed to be con- 
stant to the left of the "tidal" radius (threshold parameter), and takes off from this baseline 
value, due to interactions with the gravitational field of the Milky Way, as we move to the 
right of the threshold. Further applications and extensions are discussed in Section [6j We 
also note that our current problem of interest is a special and important case of the more 
general question of identifying the region where a function (defined on a general Euclidean 
space) assumes an extremal (minimum/maximum) value. 

Formally, we consider a non-negative regression function /x(x) on [0, 1] with the property 
that fj,(x) = To for x < d° and n(x) > To for x > d° for some d° <G (0, 1). As already men- 
tioned, quantities of prime interest are d° and To that need to be estimated from noisy data 
{Yi, Xi}f =1 , with X{S assuming values in [0,1] and Yi — jttpQ) + e,, where ej is a mean 
error. We call d° the "to threshold" of the function /i. In this generality, i.e., without any 
assumptions on the behavior of the function in a neighborhood of d° , the estimation of the 
threshold dP is a hard problem and has not been extensively addressed in the literature. In 
the simplest possible setting of the problem posited, the regression function fi(x) has a jump 
discontinuity at d°. In this case, d° corresponds to a change-point for /i and the problem 
reduces to estimating this change-point in a regression model. Such change-point models are 
very well studied; see, for example, Mueller (1992), Loader (1996), Mueller and Song (1997), 
Koul and Qian (2002), Lan, Banerjee and Michailidis (2009) and references therein. 

The problem becomes significantly harder when /i is continuous at d°; in particular, the 
smoother the regression function in a neighborhood of d , the more challenging the estima- 
tion. For example, if d is a cusp of \x of some known order p (i.e., the first p — 1 right 
derivatives of \i at d° equal but the p-th. does not, so that d° is a change-point in the p-th 
derivative), one can obtain nonparametric estimates for d° using either kernel based (Mueller 
(1992)) or wavelet based (Raimondo (1998)) methods. The convergence rate of such estimates 
decreases dramatically with p (see Raimondo (1998)), even with parametric specifications of 
jU to the right of the unknown d° (Feder (1975)). In fact, if [i is infinitely diffcrcntiable at d°, 
no estimate possibly converges at a polynomial rate, although we are not aware of any such 



concrete result in the literature. However, in most applications, the degree of differentiability 
of \i at d° will not be known, which makes necessary the development of adaptive estimation 
procedures that do not require prior information about the smoothness of the underlying 
regression function. 

In many applications of interest, /i is known to be continuously increasing, and a natural 
candidate for an adaptive procedure is isotonic regression; further, it automatically avoids 
the specification of bandwidths and equivalent smoothing parameters (see Robertson et. al. 
(1988) and Silvapulle and Sen (2005)). Applications of isotonic regression in calibration type 
problems involving thresholds are discussed, for example, in Osborne (1991), Gruet (1996) 
and Tang et al. (2010). It is known that the isotonic regression estimate fi is piecewise 
constant, usually possessing a flat stretch of for small values of the covariate. Thus, one 
can prescribe d = \nf{x : jl(x) > 0} as an estimate of d°, but this, in most cases, severely 
underestimates the true threshold. It is possible to employ penalized isotonic estimates or 
replace the in the definition of d by a positive sequence r] n converging to (at an appropriate 
rate) , but hardly anything is known in the literature about the theoretical properties of such 
procedures. Moreover, in many applications the assumption of global monotonicity of [i may 
not hold. 

This paper develops a novel approach for the consistent estimation of d° for situations where 
multiple observations can be sampled at a given covariate value that (i) does not require 
knowledge of the smoothness of \i at d°, (ii) does not require computing an explicit estimate 
of \i and (iii) is computationally simpler than most nonparametric procedures. Note that this 
multiple observations per "dose" setting is the scenario for both our data applications and 
also for most of the dose-response studies in pharmacological experiments. 

Note that, in both the motivating applications introduced in the first paragraph, the ex- 
perimenter can specify the values of the covariate (either the load of the queueing system 
or the time point when the cell-line is harvested) and subsequently obtain the corresponding 
sample responses (delays or expression levels) . In general, these sampled responses are expen- 
sive to obtain. In the first example, this is the case since generating the responses involves 
a discrete event simulation, with the response at each loading obtained by averaging over a 
number of events (e.g., customers who have received complete service from the system). For 
large scale systems, this may require tens of thousands of such events, which may exceed the 
allotted budget of resources. An alternative strategy is to rely on responses obtained from a 
fairly "small" set of events for each selected loading, which would lead to significantly more 
noisy observations, as seen in the left panel Figure [TJ In the second example, carrying out 
the biological experiment is fairly costly primarily due to the labor involved (preparation of 
cell-lines, microarray hybridization and processing). The obtained data can be noisy due to 
the inherent biological variability of the cell-lines. Nevertheless, the goal is to identify the 
transition point (s) and the corresponding levels of the threshold from such noisy data. The 
developed nonparametric methodology allows us to resolve this issue in a satisfactory man- 
ner. Specifically, it relies on testing for the value of /i at design levels of the covariate. The 
obtained test statistics are then used to construct p-values which, under mild assumptions 
on fx, behave in markedly different fashions on either side of the threshold d° and it is this 
discrepancy that is used to construct an estimate of d° . 

The remainder of the paper is organized as follows: in Section [3J the proposed procedure 
is introduced in the case of known tq, its properties derived and some extensions discussed. 
The case of unknown tq is examined in Section [3l Section [4] briefly investigates the general- 
ization to multiple change points. The performance of the procedure based on simulated data 
is studied in Section [5] where comparisons with other competing methods are also presented. 
The procedure is also illustrated on real data from the two motivating applications. Some 
concluding remarks are drawn in Section [51 while proofs of most technical results are given 
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in the Appendix. 



2 The p— value procedure: the known To case 

To introduce and motivate the proposed p-value procedure, we first consider the case with To 
known. Specifically, let Y = fi(X) + e, where /i is a function on [0, 1] and 



for dP £ (0,1). Note that no other assumptions are made on the behavior of /i around d° . 
The covariate X is sampled from a Lebesgue density px on [0, 1] and e is independent of X 
and distributed as N(0, c 2 ), with a known. Relaxations of some of these assumptions will be 
discussed later. 

As argued above, estimation of dP via direct estimation of fj, is a hard problem. The use 
of p-values allows a relatively easy solution when multiple responses can be sampled at each 
covariate value. More specifically, we have: 



with N = m x n being the total budget of samples. The e^'s are i.i.d. and distributed like e 
above and the Xi's are i.i.d. from px- The assumption of equal number of replicates (m) at 
each Xi can actually be relaxed to a certain extent and is discussed briefly in the concluding 
discussion, but for the sake of ease of exposition of the key ideas we assume this throughout 
the paper. 

At dose Xi = x, we test the null hypothesis Hq^ x : fJ,(x) = To against the alternative 
H\,x ■ fJ-(x) > r using the test statistic T(x) = ^™( Y ^~ T °'> where F^. = ^J^jLi^j ■ 
The observed p-value for this test is p^ m ^{x) = 1 — $>(T(x)), since T(x) is distributed as 
N(0, 1) under the null hypothesis. From the n different dose levels, we obtain n p-values 
p^ m \X 1 ),p( m \X 2 ),...,p( m \X n ). 

Under the null hypothesis which holds to the left of d°, the p-values have a Uniform(0,l) 
distribution. To the right of d , where the null hypothesis fails, the distributions of the p- 
values change and as m becomes large, the p- values converge to the degenerate value 0. This 
dichotomous behavior of the p-values on either side of d° can be used to prescribe consistent 
estimates of the latter. A natural way to capture this discrepancy, which we explore in this 
paper, is to consider the expected p-value curve at stage m, formally v m {x) = E(p^ m '{x)). 
Notice that this is identically 0.5 for all x < d°, irrespective of m, while for x > d°, it con- 
verges to as m increases. We illustrate this in Figure [5] assuming a — 0.5 for m = 10, 20, 
50 and 100. This simple observation can be used to construct estimates of d° which do not 
involve estimating fi. We can fit a stump to the observed p-values, with levels 1/2 and 
on either side of the break-point and prescribe the break-point of the best fitting stump (in 
the sense of least squares) as an estimate of d°. The virtue of this approach lies in the fact 
that we are able to estimate the threshold consistently, as established rigorously below, by 
merely fitting a simple mis-specified working model. Its success relies on the fact that the 
p-values eventually show stump like (dichotomous) behavior which no estimate of fi could 
have inherited, regardless of sample size. We describe our approach quantitatively below. 

For convenience, denote Zi m = p lym \Xi). Letting £,d{x) — h\{x < d), to find the stump 
with levels 1/2 and that best approximates the observed p-values Zi m , i — 1,2, ... ,n, we 
minimize 



fj,(x) = tq for x < d°, and /j,(x) > tq for x > 



(1) 



Ytj = (j.{Xi) + eij, i = l,2,.,.,n; j = l,2,...,m, 



(2) 




(3) 
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Figure 2: Expected p-value curves converging to a stump with increasing to. 

over rfe [0,1]. Let d m ,n — argmin^gjo,!] M mj71 (d), which is a natural estimator of dP. 

However, in practice, the use of 1/2 and as the stump levels may not always be the best 
strategy Firstly, the p-values to the right of d° may not be small enough to be well approx- 
imated by for a finite to; secondly, we may often have situations where the i?i m 's are not 
exact but approximate p-values. For example, if a was unknown in the above setting, T(x) 
would take the form ^JmYi./a, where a is some estimate of tr, in which case the ZimS would 
not be uniformly distributed (or even close to a uniform for modest to). In such cases, one 
can adopt a more adaptive approach by allowing the stump levels to converge to 1/2 and 
with increasing to, or by keeping the stump-levels unspecified and estimating them from the 
data itself. Such adaptive procedures might provide a better fit to the observed p-values and 
improve the precision of the estimate of dr. 

We summarize next the proposed estimation procedure and establish its consistency. Formally, 
the setup is as follows: Consider the (possibly heteroscedastic) regression model Y = [i(X) + e 
with /i as in ([1]), E(e\X) — 0, Var(e|X) > and the covariate X following a Lebesgue density 
px on [0, 1]. The available data from this model {Xi, {Yij}JLi}?=i are exac tly as hi ©■ The 
steps of the procedure are the following: 

1. For i — 1,2, ... ,n, let p^ m \Xi) = Z im denote the observed (potentially approximate) 
p-value based on a test of the hypothesis H ^ : jit(Xj) = against the alternative 
Hi ti : n{Xi) > 0, using data {ly : j = 1,2,..., to}, such that Z\ m , Z 2m , . . . , Z nm are 
i.i.d. as well. 

2. Fit a stump a m l{x < d) + /3 m l(x > d) to {Zi m }" =1 , where a m and (3 m are known 
non-negative quantities that converge to 1/2 and respectively. For d € [0, 1] define: 

M TO)Tl (d)= (Z lm ~a m ) 2 + {Z lm -fi m f (4) 

i:Xi<d r.X ■■! 

and let d m , n = argmindg[o,i] M mjra (ci). We can choose a m = 1/2 and j3 m = for all to 
to get back to the setting of ([3]). 

3. We can even let the data choose the optimal a m and /3 m by setting 

n 

m ,n = (a m ,n,Pm,n,d m ,n) = arg min y^{Z im -al(Xi<d)-/31(Xi>d)} 2 . 

6>=(a,/3,d)e[0,l] 3 f— f 

Theorem 2.1 Consider the above setup of the problem and let v m {x) = E(Z\ m \X\ = x). 
Assume further that (a) v rn (x) — > v(x) := (1/2) l(a; < d ) for each x, as to — > oo, and (b) 
Px{x) > k > for x G [d — I, d + 1} for some (small) I > 0. We then have: 
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(i) d m . n A d° as m,n — » oo, i.e., given e, 77 > 0, £/iere exists a positive integer K, such 
that for m,n > if, -P{|dm,n — > e} < 77. 

(ii) 8 m , n A- 9q = (1/2,0, d°) as m,n—$- 00. 

The theorem is proved in the Appendix. 

Flexible modeling of the p— value curve: In many applications the curve /z is an in- 
creasing continuous function, whence the expected p-value curves are continuous decreasing 
functions converging to a stump with increasing m as in Figure ©. One can then also use 
continuous parametric working models to take into account the shape of the p-value curve for 
finite m. Figure ^ suggests looking at sigmoidal curves. We propose and explore one such 
family, a 2-parameter sigmoidal family Q next: 

Q = l^ d , a (x) = ±l{x <d}+ 1 C + ^_l(Ld) ^ > d} : d € [0, 1], a > o| . (5) 

For x < d, tpd,a(x) = 1/2 and this models the region where /i = (or a constant). For x > d, 
ipd,a is decreasing, thus mimicking the finite sample behavior of the expected p-value curve 
v m . It can be shown that consistent estimates of d° may be obtained with this misspecified 
class of models, as well. Consistent estimation of d° is possible because in the limit (as a — > 00 
and d — > d ) the sigmoid converges to the stump (indicator) function v. 

We estimate the parameters d and a of the working model by solving a least squares problem, 
i.e., 

1 ™ 

(d m ,n,oc m ,n) = arg min G m , n (d, a) = - V"{Z lm - ip d a (Xi)} 2 . (6) 
de 0,1 ,a>o n 

i—l 

It can be shown under assumptions (a) and (b) of Theorem 12.11 that d m ,n A d°, as n, m— >oo, 
using arguments similar to those in the proof of that theorem. 

It should be noted that there is nothing special about using the parametric model ([5]) to 
estimate d°. Any reasonable class of models that includes (or can converge to) the stump 
function would, in principle, yield consistent estimators of d a . However, we focus on the 
stump mostly because of the conceptual/computational simplicity and accurate finite 
sample performance (as illustrated in the simulation study in Section [SJ in which results for 
the 2-parameter sigmoidal family are also included) . Figure [3] shows the true expected p- 
value curve (shown by the solid curve) and illustrates the three methods of finding d° in a 
single simulation run with a = 0.5, d° = 0.5, n = 20, m = 10: (1) fitting a stump with levels 
1/2 and (shown by the solid vertical line denoting the estimated value of d°); (2) fitting a 
stump with adaptive levels as in Theorem l2.ll (m) (shown by the dashed horizontal lines) and 
the estimated d° (shown by the dotted vertical line); (3) the fitted sigmoid model defined in 
© (shown by the dashed-dotted curve). 

2.1 The procedure under relaxed assumptions 

While the assumptions of normality of errors and known variance were used to motivate the 
procedure of stump based approximations to the p-values in the previous section, Theorem 
12.11 shows that these assumptions can be relaxed considerably. Further, as the theorem does 
not require the responses to be continuous, the procedure is also valid in discrete response 
settings, one of which we illustrate in the subsequent discussion. We discuss below a number 
of different scenarios that can arise in practice. 

(i) Homoscedastic errors with unknown variance: Suppose that the e^-'s are continuous i.i.d. 
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Figure 3: Fitting different models to the p-value curve. 

random variables with mean and unknown variance a 2 and independent of {Xi}" =1 . First, 
define "working" p-values p^(X 1 ),p( m \X 2 ), . . . ,p (m) (A„) as p {m) {Xi) = 1-$ ( ^p^" To) ) 
where a 2 (Xi) = J^JLiO^ij — Yi-) 2 /(m — 1). The p-values are then independent, v m (x) = 
E{p( m \Xi)\Xi — x) converges to u(x) and Theorem 2.1 applies, yielding the consistency of 
the stump-based least squares estimates of d°. However, owing to the homoscedasticity of the 
errors, the constructed p-values are clearly sub-optimal since we could have replaced <x(-Xj) in 
the definition of the i'th p-value by (T m ,m where <7„ n = Y17=i Sj=i 0*ij ~ Yi..-) 2 / (mn — n) 
is the standard pooled estimate of a 1 (and is significantly superior to each crpQ)). Thus, 
least squares estimates of d° based on (6) and (c) in the build-up to Theorem l2.11 using such 
improved p-values would certainly yield consistent estimates of dr. Unfortunately, we are 
no longer quite in the setting of Theorem 12.11 as the presence of a in each p-value makes 
them dependent, and therefore cannot invoke its conclusions to deduce consistency. We tackle 
the consistency issue for this problem below. Roughly speaking, as m, n grow, a stabilizes 
quickly, the p-values become approximately independent and the scenario approaches that of 
Theorem O 

Theorem 2.2 Assume that (a) inf ( jo +/ 3< d < 1 (i(d) > tq for any j3 > 0, and that (6) px{%) < K 
for x G [d°,d° + if) for some rj > 0. Let d m>n be the least squares estimate of d ob- 
tained as: d m ,n = argmin d J2i-.x,<d { z ir,i-\) 2 + J2i-.x z >d {Z im -0) 2 , where Z im = 
1 — $ ( v 7 "^"'-- — ISl^y these being the 'improved' p-values alluded to in (i) above. Then 

Ui rn.n 

asm,Ti->oo. 

A condensed version of the proof is available in the Appendix. 

Remark: If we know that the errors are normal, the statistics \JmYi.ja follow a t mn - n 
distribution. Letting F tmn _ n denote the corresponding distribution function, we can take 

Zim = 1 — Ft m n-n ( V ^ Y '' ) and the corresponding estimates of d° continue to be consistent. 

(ii) Heteroscedastic errors: Consider a regression set-up with continuous responses and het- 
eroscedastic errors: i.e. a 2 (x) = Var(e|A = x) varies with x. Thus, a 2 (Xi) is estimated by 

n 2 {Xi) = YZi ( Y v - Y *-) 2 /( m - 1) and Z im = 1 - $ ( ^|[^y To) ); in this case, pooling 
is no longer possible unlike case (z). It is not difficult to see that v m {x) = E(Zi m \Xi = x) 
converges to v(x) as above and the conclusions of Theorem 12.11 hold. In case the errors are 
conditionally normally distributed, $ in the definition of the p-values can be replaced by 
Ft m _i, the distribution function of the f m _i distribution. 
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(Hi) Discrete responses: Settings where the response is discrete can also be considered. One 
can think of the covariate values as dose-levels and suppose, at each dose level, JQ, that a 
binomial experiment with m independent subjects is performed, and the response Yi is the 
number that show a reaction to the dose. The function is the probability that a subject 
yields a reaction at dose x and is assumed to be at a baseline value po > for x < d° and 
greater than po otherwise (note that po = gives a pathological situation) . 

We base our p— value at dose level Xj on the normalized statistic (Yi — mpo)/ \Jmp§ (1 — po) 
with (working p— value) Z im = 1 — $((J^ — mpo) / yrnpo (1 — Po))- Alternatively, the Zi m 's 
could also be defined to be p-values based on the exact binomial distribution: i.e., Zi m = 
1 — F myPa (Yi) where F m ^ Po is the distribution function of Binomial(m,po)- The conditions of 
Theorem 12.11 are easy to verify in either case and the conclusions of the theorem continue to 



2.2 A Digression: Composite Hypotheses 

Consider now a situation where (i(x) is known to be strictly less than (a known) Co if x < d° 
and strictly greater than Co for x > d°. We keep the behavior at d° unspecified. For a 
monotone function /i this reduces to estimating its inverse at Co- The latter problem has 
been well-studied in the literature (Banerjee and Wellner (2005), Tang et. al. (2010) and 
references therein) but only under explicit shape and/or smoothness constraints on fi. Our 
formulation, on the other hand, is much broader in scope as it requires neither monotonicity, 
nor smoothness assumptions. 

To formulate the problem, consider the model posited in ([2]) with homoscedastic normal 
errors and /i as in the beginning of the previous paragraph. At dose X; = x the statistic 
T{x) = iJrn{Yi. — Qo)/a is used to test the composite hypothesis, H 0yX : fj,(x) < Co versus 
Hi,x ■ n{x) > Co, with rejection for large values of T(x). Construct an approximate p-value as 
p( m \x) — 1 — <&(T{xj). Define v m (x) = E[p( m \x)]. It is easy to check that v m {x) converges 
to 1 as m — > oo for x < d° and to for x > d°. This suggests 



as a natural estimate of d°, with Z im = p^ (Xi),i = 1, 2, . . . , n. 

We present an analogue of Theorem 12.11 (in the current setting) that generalizes the above 
strategy. We start by summarizing the proposed estimation procedure. 

1. Identical to Step (a) in the build-up to Theorem[2Tj except forp( m )(X;) = Z im denoting 
the observed (potentially approximate) p-value based on a test of the hypothesis H 0yi : 
n(Xi) < Co against the alternative H\ t i : fi(Xi) > Co- 

2. Step (b) remains the same as before but with a m converging to 1. 

3. Step (c) remains unaltered. 

Theorem 2.3 Let v m {x) — E(Zx m \Xx — x) and further assume that (a) u m (x) — > 1 for 
each x < d° and v m [x) — > for each x > d , as m — > oo, and, (b) Px( x ) > ^ > for 
x 6 [d — l,dP + I] for some (small) I > 0. Then, we have 

1- d m<n A d° as m, n— > oo. 

2- 9 m .n 6o = (1, 0, d°) as m, n — > oo. 

The proof of the theorem is similar to that of Theorem 12.11 and is therefore skipped. 



hold. 




i:Xi<d 



i:Xi>d 
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3 The Case of an Unknown tq 



In this section, we address the more realistic situation where To is unknown. As will become 
apparent in subsequent developments, a number of complications arise in this setting. In 
order to focus on the main ideas, we confine ourselves to the setting of continuous responses. 
Throughout this section, we consider the problem of fitting a one-parameter stump with levels 
1/2 and on either side of the jump- location, the main parameter of interest. 

Homoscedastic errors: Consider first, the homoscedastic error setting as in (i) of Sec- 
tion 12.11 The consistency of the least squares estimate of d° in this model by fitting a stump 
(with levels 1/2 and on either side of the jump) to the observed p-values was established 
in Theorem 12.21 However, with an unknown tq, these p-values can not be constructed. A 
natural alternative is to replace To with some appropriately consistent estimate. The next 
theorem shows the consistency of the estimate of d° obtained by fitting a stump-model to the 
observed p- values when an appropriate estimate of t is used. For a, t > 0, define: 

Zf m (r) = l-$(V^(F l .-T)/a) (7) 

for i = 1, 2, . . . , n. 

Theorem 3.1 Consider model (Hp where the errors {tij} are i.i.d with variance o~q and 
independent of {Xj}. Suppose that f = f m n is a consistent estimator of t such that 
y / m(f — To) = o p (l). Let a = a m>n he a consistent estimator of ctq and let d = d m . n be 



estimated as: 



d m ,n = arg min 
«Je [0,1] 



1 1 2 ^ 

(f m ,„) - - \ + { Z im n ( f m,n) - oj 



E 

,:Xi<d v ' i:Xi>d 



(8) 



Then d m „4rf° asm,n->oo. 



Remark: In a situation where d° may be safely assumed to be greater than some known 
positive n, an estimate of To satisfying the condition of the above theorem can be obtained 
by taking the average of the response values on the interval [0,77]. However, this does not 
offer a satisfactory solution, since such an rj may not be known in various applications. Also, 
even if 77 is known but small, the estimate thus obtained will be unsatisfactory unless n is 
really large. Below, we adopt a more principled approach to the estimation of To that does 
not require such background knowledge, once again using p-values. 



We now focus on constructing an explicit estimator f of To as required in Theorem 13. 1[ using 
p-values. Recall the definition of Z? m (T) in (|7J|. Let t > To and note that as m increases, 
for n{Xi) < t, Zi™' n (r) goes to 1 in probability, while for [i(Xi) > t, Z?™' n (r) goes to in 
probability. For any t < To, it is easy to see that Z^' n (r) always goes to in probability, 
whereas when r = t , Z^'"(t) goes to for X{ > dP, but is uniformly distributed on (0, 1) 
for Xi < d° for every m. Thus, it is only when r = To that Z^"'"(t)'s are the closest to 1/2 
for a substantial number of i's. This suggests a natural estimate for To: namely, 

n 

f m ,„ = argmin^{zf-"(T) - 1/2} 2 . (9) 

r * — * 

8=1 

Once T min is obtained, an estimate of d , say d m , n can be obtained by taking f min to be Tm,n in 
© and a m ^ n to be a m>n . This method of estimating d° and To is referred to, subsequently, as 
Method 1. Theorem l3.2l shows that under some mild conditions on the function /1, y/m (f m „ — 
To) is o p (l). This along with the fact that a m ,n is consistent for 00, implies that d m ,n is 
consistent for dP by Theorem 13. II 
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Theorem 3.2 Consider the same setup as in Theorem \3.1\ Further suppose that the regres- 
sion function [i satisfies: 

(A) Given r\ > 7 there exists e > such that, for every r > tq, 

px{x)dx < rj. 



x>d°:\fi(x)-T\<e} 

Also assume that<f> m , the density function of j '<7q, converges pointwise tocj), the standard 
normal density. Then y/m {f m ,n — r o) ^0 asm,n^oo. 

Remark: Condition (A) is guaranteed if, for example, fi is strictly increasing to the right of 
d° although they hold under weaker assumptions on \x. In particular it rules out flat stretches 
to the right of d°. Note that the assumption that 4> m converges to </> is not artificial, since 
convergence of the corresponding distribution functions to the cdf of the standard normal is 
guaranteed by the central limit theorem. 

An alternative method (Method 2): Notice that the previous method involves the esti- 
mation of d° in two steps: first by estimating To and subsequently using this estimate 
to approximate the p-values to which a stump is fitted, as described in ([8]). An alter- 
native one-step method for estimating d° (that avoids estimating ro) is presented, when 
fi is increasing. Define t; n (x) = E(Y\X < x). A natural estimate of £ n ( x ) is given by 
iin(x) = Si-xKxSjIi Yij/{ m — x )} ■ ^ can be easily checked that for x £ (0, 1), 

£,n(x) — £(x) is O p ((mn) _1 / 2 ), a fact that will be used later. As our estimate of d° we propose: 



dm n = arg min 
de[o,i] 



E {zt n (Ux i ))-i/2} 2 + {4r(to))-o} 2 

i:Xi<d i:X .,/ 



(10) 



We do not formally establish the consistency of this procedure in the Appendix, but provide 
a heuristic discussion below. The performance of this method is also assessed via simulation 
studies. Once d m ,n has been obtained, an estimate of To is given by £ n (dm,n)- 

Discussion: Denote the quantity within the big square brackets on the right side of the 
above display by 9(d). Let r\ > be such that < d° — r\ < d° + r\ < 1. Consider the 
difference ^(d — if) — ^(d ), which can be written as: 



E [{CT(lnPQ))} 2 - {zt"(UXi)) - 0.5} 

d°--q<Xi<d° 

Now, for any X, e (d° - T), dP], 

V 0~m,n 0~rn,n J 

For sufficiently large m, n, the first term within the brackets on the right side of the above 
display is approximately distributed like a standard normal, while the second term is small by 
virtue of the fact that y/m (£ n (Xj) — ro) is O p (n -1 / 2 ) (where we tacitly make use of the fact 
that these Xfs are all bounded away from 0). It follows that the right side is approximately 
distributed like a Uniform(0,l) denoted by Ui. Thus, — 77) — 9(d°) behaves approximately 

like J2i:d°-r,<Xi<d U f ~ ^2i:d°- v <Xi<d° - °- 5 ) 2 for that are approximately uniform 
and weakly correlated for sufficiently large m, n. But this quantity will tend to be non-negative 
with high probability. A similar argument can be used to show that 9(d a + rf] — ^>(d°) will 
tend to be non-negative with high probability, when fi is increasing, which we leave to the 
reader. This illustrates why the minimizer of is close to d°, with high probability, in the 
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long run. 



Heteroscedastic Errors: We briefly discuss the case of heteroscedastic errors as in (ii) of 
Subsection O Let o* pr<) = ££=i (K y - F,) 2 /(™ " 1) and Z.f™ (Xl) (r) = 1 - *(yfiX(Yi- ~ 
T)/a m (Xi)). To construct a consistent estimator f of t we can use © with Z°™ ,n (r) changed 
to ^f™ (Xl) (r). Now, Method 1 can be implemented to obtain an estimator for d° using © 

with Zt™' n (f mi n) replaced by Z^ tyX '\f). Method 2 can also be implemented by replacing 
the superscript <7 mj „ by a m (Xi) in (|10p . 

4 Multiple Change Points 

While our procedure was primarily motivated by applications with single baseline thresholds, 
it can be extended without much difficulty to the case of multiple thresholds. To illustrate 
the idea, consider a regression model where the function /j.(x), with x varying in (0, 1) is at 
its baseline value, say r , on an interval of the form [a, b] with < a < b < 1 and stays above 
the baseline elsewhere. For ease of illustration, we restrict ourselves to the situation with a 
continuous response and homoscedastic errors with unknown variance, as in Section 2.1 (i). 
As in that problem, we would construct p- values at each point, {Z im }f =1 . Our estimates of 
a and b would be obtained as: 

(a m ,„,6 m ,„) = argmin 

a<b 

The computation of the minimizer would proceed by searching over all pairs (Xu) , X(j\) with 
i < j. By extending the techniques used in our proofs one can establish consistency of the 
above least squares estimates without much additional difficulty. In case To is unknown, one 
can use exactly the same estimate of To as advocated in ([§]). Similar reasoning as in the discus- 
sion preceding Q shows that f m) „ would be consistent for To in this setting. It is not difficult 
to see how the procedure would extend to other kinds of regions, for example, a baseline 
zone which consists of a disjoint union of finitely many intervals, though the computational 
complexity increases rapidly with the number of disjoint intervals concerned. The structure 
of the baseline zone will, of course, be determined by the application under consideration. 

The procedure can also be extended to allow for situations where /i attains its minimum 
and maximum values, say T m j„ and T max , on disjoint intervals [a,b] and [c, d], as is the case 
for the gene expression data. Assume, for simplicity, that the minimum and maximum values 
are known (or that very reliable estimates are available). One then applies the procedure 
in the previous paragraph to determine [a,b]. To determine [c,d], note that in the model 
—Yi = —\x(x) — ej, [c, d] is the region on which —fj, hits its minimum and therefore the above 
procedure can once again be applied with the signs of the responses flipped. 

When the minimum and the maximum are unknown, a natural temptation would be to 
estimate the minimum via ([9]) in the original problem, and the maximum via (j9|) again in the 
sign-flipped problem, separately; however, that becomes suspect in this situation since [i has 
two flat stretches while the method advocated in (9) is theoretically justified only when flat 
stretches at levels larger than the minimum are ruled out; see Assumption (A) of Theorem l3.2l 
in this context. However, in many situations, there will be a reasonable degree of separation 
between the minimum and the maximum and it will be possible to identify these values up 
to mutually disjoint intervals (for more on this issue see the analysis of the gene data exam- 
ple). In such cases, the procedure in (9) can again be brought into play by doing restricted 
searches over the r domain: to identify the minimum, minimize the criterion function in (J9j) 
only over the interval in which the minimum is expected to lie (as opposed to searching over 
the entire putative range of n) and do a similar analysis for the maximum (by switching to 



J2 (z im -i/2) 2 + z l 

Xi£[a,b] Xi^[a,b] 
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Figure 4: Plots of the four increasing regression functions /i's (left panel) and the two tent 
shaped fj,'s (right panel). 

the sign-flipped problem). In the absence of other flat stretches (besides the minimum and 
the maximum) this procedure will estimate the extremal values accurately. 

5 Simulation Results and Data Analysis 

We first study the performance of our proposed methods through an extensive simulation 
study. We then compare our approach with other competing procedures mostly developed 
in the dose-response setting. The methodology is subsequently illustrated on the motivating 
examples of the complex queueing system and the gene expression data. In the simulation 
study, we compare in a number of settings, the simplest model of the one-parameter stump to 
both the more complex, sigmoid and the three-parameter adaptive stump models. We under- 
take a comprehensive evaluation of the two methods proposed for estimating the threshold d° 
in the presence of unknown tq , we investigate the performance of the proposed methodology 
when the threshold is located close to the boundary of the design space and finally we discuss 
how one should allocate a fixed budget of samples between number of doses n and replicates 
m . 

5.1 Simulation Studies 

In our numerical studies, six choices of the regression function \i are considered. Four of 
these are non-decreasing and the remaining two are "tent" -shaped. The four monotone ones 
are depicted in the left-panel of Figure HI all of which are to the left of d° = 0.5. Specif- 
ically, Mo, shown by the solid line, is a stump and is identically equal to 0.5 to the right 
of d°; Mi described by the dotted line is a piece- wise linear function (a kink-model) rising 
from to 0.5 between d° and 1; M2, the convex curve, grows like a quadratic beyond d°, 
whilst Ms(x) = exp(— X/(x — 0.5)) l(a; > 0.5) (for an appropriate A so that Mz(l) = 0.5) 
is infinitely differentiable at d°. Thus, from Mo to M3 we have four functions exhibiting 
increasing smoothness at d°. 

Our two "tent" -shaped regression functions are depicted in the right panel of Figure H] 
they are identically till d° = 0.5 and tent-shaped beyond: M4 rises linearly from 0.5 to 

0. 75 with unit slope and then declines symmetrically from 0.75 to 1, reaching at the point 

1, while M5 is a slight variant of M4 rising linearly with unit slope from 0.5 till 0.8 and 
then decreasing with unit slope from 0.8 till 1.0. Estimation of d° = 0.5 for the tent-shaped 
functions is expected to be more challenging than for the monotone ones. Note that for the 
monotone curves, the "signal" keeps on increasing as we move to the right of d°, whereas with 
the tent-shaped curves, the signal starts to weaken beyond a point, and with M4 especially, 
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Table 1: RMSEs for the one-parameter stump (first entry of each column) and the two- 
parameter sigmoid (second entry of each column) with a = 0.1 (top) and 0.3 (bottom panel) 
for five models and different choices of to and n. 



(7 = 0.1 












(to, n) 


Mo 


Mi 


M 2 


M 3 


M 4 


(5, 5) 


0.153, 0.193 


0.154, 0.195 


0.163, 0.217 


0.154, 0.199 


0.154, 0.195 


(5, 10) 


0.130, 0.170 


0.130, 0.169 


0.119, 0.164 


0.099, 0.150 


0.116, 0.174 


(10, 10) 


0.127, 0.169 


0.131, 0.171 


0.110, 0.159 


0.093, 0.142 


0.124, 0.170 


(10, 20) 


0.079, 0.126 


0.068, 0.120 


0.097, 0.126 


0.075, 0.110 


0.060, 0.120 


(10, 50) 


0.030, 0.066 


0.033, 0.069 


0.106, 0.100 


0.082, 0.082 


0.033, 0.073 


(20, 50) 


0.028, 0.062 


0.030, 0.066 


0.088, 0.088 


0.073, 0.077 


0.030, 0.071 


(50, 100) 


0.014, 0.033 


0.014, 0.032 


0.073, 0.058 


0.071, 0.061 


0.014, 0.033 


a = 0.3 












(to, n) 


M 


Mi 


M 2 


M 3 


M 4 


(5, 5) 


0.154, 0.195 


0.155, 0.202 


0.201, 0.235 


0.171, 0.219 


0.208, 0.255 


(5, 10) 


0.130, 0.169 


0.132, 0.175 


0.200, 0.222 


0.140, 0.178 


0.214, 0.258 


(10, 10) 


0.130, 0.164 


0.137, 0.175 


0.166, 0.198 


0.119, 0.163 


0.148, 0.226 


(10, 20) 


0.069, 0.115 


0.089, 0.134 


0.177, 0.181 


0.116, 0.135 


0.120, 0.220 


(10, 50) 


0.032, 0.062 


0.086, 0.093 


0.193, 0.153 


0.128, 0.114 


0.095, 0.212 


(20, 50) 


0.032, 0.062 


0.061, 0.079 


0.162, 0.134 


0.117, 0.092 


0.061, 0.135 


(50, 100) 


0.014, 0.033 


0.039, 0.041 


0.131, 0.093 


0.098, 0.081 


0.039, 0.050 



comes back to the baseline value at the right boundary. 

For each allocation pair (to, n) and regression function fi(x), we generate responses {la, 5^2, 
. . . , Yi m } at covariate value Xi ~ i/(n + 1), for i = 1,2, ... ,n, with Yij — /i(xi) + , the 
{ejjj's being i.i.d. iV(0, cr 2 ). Two different choices of cr (0.1 and 0.3) are considered. For each 
combination of model, allocation and noise-level (cr), the performance, in terms of Root Mean 
Square Error (henceforth RMSE), of the estimator under consideration is evaluated based on 
2000 replicates. Note that a non-random uniform design is used to generate the covariates, 
which is a slightly different procedure from generating the n covariates uniformly from (0, 1). 
Of course, asymptotically, it does not make a difference, but for small n, the regularity of the 
uniform grid has a salutary effect on the performance of the least squares estimates. 

One-parameter stump vs sigmoid model: Table 1 gives the RMSEs for estimating 
d°, for a number of different (m, n) combinations, models Mo through A/ 4 and two values for 
a (0.1 and 0.3), using a one-parameter stump (1/2) l(x < d) and a two-parameter sigmoid, 
as described in ([5]). The first key observation is that in the relatively high "signal-to- noise" 
regime (cr = 0.1), the inference problem is relatively easier and the one-parameter stump 
outperforms the two-parameter sigmoid almost uniformly (with an occasional reversal in high 
to, n settings). Secondly, M 2 and M3 show greater RMSEs in general than Mo and Mi (which 
could be ascribed to the former two models exhibiting greater smoothness at d°), and inter- 
estingly enough greater RMSEs than M4 as well (which is a misspecified model as it returns 
to at the right end of its support). As expected, increasing both to and n leads to improved 
performance, by and large. The use of the sigmoidal approximation in the more modest 
"signal-to-noise" regime (cr = 0.3) for combinations of larger to and n leads to better results 
for the smoother models M 2 and M3, but not for M4, since the shape of the sigmoidal curve 
conflicts badly with that of the corresponding expected p-value curve. Further, the RMSEs 
for M 2 generally tend to be worse than M3 for both cr = 0.1 and a = 0.3, even though A/3 
is smoother in the vicinity of d°. This can be explained by the fact that for the most part to 
the right of d°, M3 provides more signal than M 2 , which makes detection easier in the former 
case. As to and n increase, M 2 improves and its performance comes closer to that of M3 (see 
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Table 2: RMSEs for d° and tq with a = 0.1 and 0.3, respectively for models M3 and M4 and 
different choices of to and n, using the two proposed methods. 



M 3 




a = 


0.1 






M 3 




a = 


0.3 




(m, n) 


d\l) 


d\2) 


ro(l) 


TO (2) 


(to, n) 


Ai) 


d u (2) 


to(1) 


TO (2) 


(5, 5) 


0.091 


0.105 


0.040 


0.032 




(5, 5) 


0.180 


0.161 


0.120 


0.097 


(5, 10) 


0.086 


0.094 


0.027 


0.023 




(5, 10) 


0.191 


0.158 


0.099 


0.067 


(10, 10) 


0.070 


0.081 


0.019 


0.016 




(10, 10) 


0.137 


0.121 


0.067 


0.048 


(10, 20) 


0.078 


0.079 


0.014 


0.011 




(10, 20) 


0.141 


0.122 


0.049 


0.030 


(10, 50) 


0.087 


0.084 


0.009 


0.006 




(10, 50) 


0.147 


0.135 


0.033 


0.019 


(20, 50) 


0.077 


0.074 


0.006 


0.004 




(20, 50) 


0.120 


0.112 


0.021 


0.013 


(50, 100) 


0.072 


0.071 


0.003 


0.002 




(50, 100) 


0.102 


0.099 


0.009 


0.006 



(50,100) setting). 

Comparisons between the one-parameter stump and the three-parameter (adap- 
tive) stump: We compared the 1-parameter stump (with 0.5 and as the levels on either 
side of the unknown split-point) to the three-parameter stump, where both levels as well as 
the split-point are kept unspecified. The comparison is done for three of the six models at 
two different a's, 0.1 and 0.3, for a set of different (to, n) allocations. Although the results are 
not shown due to space considerations, the results to a large extent are not radically different, 
and neither method systematically outperforms the other. For a — 0.1 and relatively small 
to, n, the adaptive stump exhibits slightly better performance for M\ and M5 and is more or 
less comparable to the 1-parameter stump for M%. At a = 0.3, the advantage of the adaptive 
stump lessens. Finally, for large (to, n), both types of stumps behave similarly. Hence, in all 
subsequent simulations one-parameter stumps are employed. 

Assessment of Methods 1 and 2 for estimating d° when To is unknown: We next 
present a comparison of Methods 1 and 2 from Section [3] for estimating d°, when tq is as- 
sumed unknown, for two models (due to space considerations): the monotone M3, and the 
tent-shaped M4. For each model, we present the RMSEs for d° for each of the methods and 
also those for tq in Table 2. 

It is noted that for fixed a and (m, n), the RMSEs for d° increase from M3 to M4. This 
phenomenon is also observed when we compare any one of the monotone models Mi , M2 or 
M3 with either of the tent-shaped ones M4 or M5. Since the stump is a monotone working 
model, this result is expected in light of the first three models and the lack of monotonicity of 
the others. Overall, Methods 1 and 2 are comparable, although for smaller a = 0.1, Method 
1 exhibits smaller RMSEs for d°, while for larger a = 0.3, Method 2 dominates slightly. It is 
worth noting that for model M4, for small to, the performance deteriorates with increasing 
n, for large a. Although counter-intuitive, this phenomenon can be explained by noticing 
that in this situation the expected p- value curve has a "V" -shape to the right of d° and 
conforms badly with the monotone nature of the fitted stump. For small n, this discrepancy 
is somewhat masked by the small number of observations (i.e., p-values), but for large n, this 
non-conformity is clearly exhibited. Regarding To, Method 2 has a slight edge over Method 
1 at both noise levels. Finally, for the challenging setting of M4, the performance of both 
methods is rather inferior for small values of to and a = 0.3. 

Comparisons among the methods for extreme values of dP: We have so far concen- 
trated on the case with d° = 0.5. We investigate next, settings where d° is closer to one of the 
boundaries; specifically, where the "action" starts fairly quickly at a low covariate level and 
also where the "action" starts late. To this end, we consider the models Mi and M2, where 
Mi is flat at zero till 0.2 and then rises linearly with slope 1 all the way up to 1, while Mi is 
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M 4 




a = 


0.1 






M 4 




a = 


0.3 




(m, n) 




d u <2) 


rn(l) 


Tn (2) 


(m, n) 


d u (l) 


d u (2) 


rod) 


T() (2) 


C5 5*1 


0.138 


0.112 


0.067 


0.032 




C5 5~1 


0.243 


0.221 


0.110 


0.096 


(5, 10) 


0.109 


0.113 


0.045 


0.026 




(5, 10) 


0.311 


0.294 


0.091 


0.077 


(10, 10) 


0.074 


0.102 


0.027 


0.018 




(10, 10) 


0.262 


0.232 


0.081 


0.058 


(10, 20) 


0.046 


0.060 


0.018 


0.011 




(10, 20) 


0.286 


0.270 


0.064 


0.047 


(10, 50) 


0.034 


0.033 


0.012 


0.007 




(10, 50) 


0.313 


0.317 


0.053 


0.044 


(20, 50) 


0.025 


0.028 


0.008 


0.005 




(20, 50) 


0.133 


0.137 


0.040 


0.021 


(50, 100) 


0.014 


0.015 


0.003 


0.002 




(50, 100) 


0.050 


0.040 


0.016 


0.006 



Table 3: RMSEs in the kink model with d° = 0.2 (top panels) and 0.8 (bottom panels) 
for different choices of to and n in both tq known and unknown cases, for cr = 0.1 and 0.3 
respectively. 



<f = 0.2 


a = 0.1 




<f = 0.2 


a = 0.3 


(m,n) 


<f 


d"(l),d u (2) 


Tb(l),T (2) 




(m, n) 


<f 




ro(l),T (2) 


(5, 5) 


0.102 


0.333, 0.057 


0.327, 0.048 




(5, 5) 


0.128 


0.429, 0.196 


0.349, 0.138 


(5, 10) 


0.082 


0.291, 0.060 


0.260, 0.038 




(5, 10) 


0.120 


0.436, 0.181 


0.329, 0.111 


(10, 10) 


0.081 


0.259, 0.054 


0.238, 0.027 




(10, 10) 


0.095 


0.396, 0.122 


0.317, 0.080 


(10, 20) 


0.054 


0.143, 0.045 


0.127, 0.019 




(10, 20) 


0.083 


0.367, 0.114 


0.287, 0.058 


(10, 50) 


0.031 


0.044, 0.033 


0.026, 0.011 




(10, 50) 


0.080 


0.308, 0.112 


0.229, 0.036 


(20, 50) 


0.027 


0.025, 0.027 


0.012, 0.008 




(20, 50) 


0.060 


0.210, 0.075 


0.161, 0.024 


(50, 100) 


0.015 


0.015, 0.015 


0.005, 0.003 




(50, 100) 


0.039 


0.058, 0.045 


0.024, 0.010 



flat all the way till 0.8 and then rises linearly with unit slope. These are simple variants of 
the kink- model Mi . We report the performances of the 1 parameter stump when r is known 
(to be 0) with (the homoscedastic error variance) cr = 0.1 and 0.3 for both models, and also 
the performances of Methods 1 and 2 in the t unknown case for different (to, n) allocations 
in Table 3. The design-points are chosen similarly to the case d° = 0.5. 

For t known, the one-parameter stump behaves qualitatively as one would expect. The 
RMSEs tend to be bigger for the d° = 0.8 case (for a fixed allocation and noise level), though 
with increasing (to, n) the RMSEs for both models tend to converge and are similar to the 
numbers for Mi in the t known case, since the models Mi, Mi and M2 look exactly similar 
in a small neighborhood of the kink, and it is this local behavior that drives the asymptotic 
MSE as to, n go to infinity. More interesting is the case when To is unknown. In this case, 
with the model M 2 , Method 1 generally produces slightly smaller RMSEs for d° for both 
cr = 0.1 and 0.3 apart from some of the "rich allocation" scenarios where the performances 
of both methods are very comparable. Method 2, on the other hand, gives better RMSEs 
for t . With Model Mi, we see a markedly different phenomenon. For small (m,n) in the 
case cr = 0.1 and cr = 0.3 (and also for some small to, large n scenarios in the latter case), 
Method 1 shows very poor performance compared to Method 2 with much larger RMSEs for 
d° . A glance at the RMSEs for To reveals what is happening. The estimates of t (needed in 
Method 1 to compute surrogate p-values) are extremely poor, and this leads to biased esti- 
mates for d° . The poor performance is, of course, exacerbated for higher values of a. These 
results indicate that Method 1 can perform pretty badly for small n; in that case, the number 
of covariate values in the flat stretch is small for Mi and this affects the estimation of To badly. 

A related allocation problem: In our simulation study, the proposed procedures for both 
known and unknown To were evaluated for a combination of to and n values. However, in 
practice one is given a total budget of N = n x to samples that need to be allocated to n 
covariate values and to replicates at each covariate value, respectively. Intuitively, increasing 
the number of replicates to decreases the "bias" , whereas increasing the number of values n 
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d u = 0.8 


a = 0.1 




d u = 0.8 


a = 0.3 


(771, n) 


d u 


d u (l),d u (2) 


T (1),T (2) 


(771, 77.) 


if 


d u (l),d u (2) 


Tn(l),r (2) 




0.213 


0.128, 0.174 


0.028, 0.028 






0.178 


0.114, 0.146 


0.080, 0.078 


(5, 10) 


0.119 


0.088, 0.118 


0.023, 0.019 




(5, 10) 


0.115 


0.099, 0.109 


0.060, 0.051 


(10, 10) 


0.126 


0.095, 0.114 


0.016, 0.013 




(10, 10) 


0.120 


0.092, 0.111 


0.044, 0.037 


(10, 20) 


0.060 


0.050, 0.057 


0.011, 0.008 




(10, 20) 


0.087 


0.088, 0.092 


0.033, 0.025 


(10, 50) 


0.033 


0.033, 0.035 


0.008, 0.005 




(10, 50) 


0.084 


0.097, 0.091 


0.024, 0.016 


(20, 50) 


0.026 


0.025, 0.028 


0.005, 0.004 




(20, 50) 


0.060 


0.068, 0.064 


0.017, 0.011 


(50, 100) 


0.016 


0.015, 0.015 


0.002, 0.002 




(50, 100) 


0.038 


0.042, 0.039 


0.008, 0.005 



Table 4: Optimal allocation (to, n) pairs for a fixed total budget N — to x n 



Ml 


Method 1 


,7 = 0.1 (7 = 0.3 


Method 2 


£7 = 0.1 (7 = 0.3 




N = 100 
N = 200 


(6,17) (33,3) 
(7,29) (40,5) 


N = 100 
N = 200 


(4,25) (33, 3) 
(6,33) (15,13) 


M5 


Method 1 


(7 = 0.1 (7 = 0.3 


Method 2 


(7 = 0.1 (7 = 0.3 




JV = 100 
N = 200 


(8,12) (33,3) 
(7,29) (67,3) 


N = 100 
N = 200 


(5,20) (33,3) 
(6,33) (15,13) 



of the covariate, decreases the variance of the estimators. The optimal allocation occurs when 
the two terms are balanced, usually at a moderate value of n and m (which depends on the 
value of cr and the regression function). Thus, for a fixed N, one expects that the RMSEs 
exhibit a "U-shape" as a function of to; further, for larger a the optimal allocation would 
occur at a larger value of m. 

We investigate this allocation problem through a simulation, but due to space considera- 
tions we present the optimal allocations for models Ml and M5 for both Methods 1 and 2. 
The setting under consideration is d° = 0.5, N = 100 and 200 and a = 0.1 and 0.3. All pos- 
sible combinations of to and n that approximately satisfy the total budget were considered. 
The optimal allocations are shown in Table |U 

It can be seen that both methods for small a favor lots of covariate values and few replicates, 
while the situation is reversed for high a. Further, qualitatively similar results, in accordance 
with our observation above, are obtained for the other three models examined (M2, M3 and 
Mi). Nevertheless, a few anomalies are present; specifically, as we are sampling from the 
discrete uniform design on [0,1], and dP = 0.5, sometimes the optimal allocation occurs at 
the rather extreme value n = 3. This is due to the fact that in that case, the covariate values 
are placed at 0.25, 0.5 and 0.75, and when m is large, the fitted break point d n is usually 
(for many replicates) 0.5, the true parameter value. Whenever this is the case, the estimation 
error is exactly zero, making the observed RMSEs small. With the same budget, a larger n 
(say n = 5) can also lead to 0.5 as a covariate value, but the value of m decreases in the 
process (thereby increasing the bias) and there are more options for the fitted break point to 
differ from 0.5, leading to larger RMSEs. 

Some practical recommendations: Based on our extensive simulation study (including 
results not shown here due to space considerations) , the following practical recommendations 
are in order. Overall, it is better for one to invest in an increased number of covariate values 
(n), rather than replicates (to). In the case of a known r, the simple stump model performs 
well overall, while the more complicated adaptive stump model offers only marginal improve- 
ments. In the case where the threshold d° is closer to the boundaries, investment in n proves 
fairly important. For unknown level To, none of the proposed methods dominates the other, 
the result depending on both the noise level and the model under consideration. However, 
for estimation of the level tq, Method 2 exhibits a clear advantage over its competitor. Some 
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Tabic 5: RMSEs for the five procedures for different choices of m and n when a = 0.3 and 
the actual model is M\ (left table) and M2 (right table). 



(m,n) 


Pi 


P 2 


P 3 


Pi 


Ps 


(5,5) 


0.163 


0.207 


0.339 


0.255 


0.299 


(5, 10) 


0.134 


0.176 


0.304 


0.307 


0.344 


(10,10) 


0.119 


0.120 


0.227 


0.228 


0.328 


(10,20) 


0.092 


0.079 


0.191 


0.265 


0.295 


(10,50) 


0.085 


0.042 


0.179 


0.310 


0.247 


(20, 50) 


0.060 


0.030 


0.128 


0.212 


0.176 


(50, 100) 


0.038 


0.013 


0.080 


0.142 


0.114 



Pi 


P 2 


Pa 


Pi 


Ps 


0.204 


0.241 


0.420 


0.291 


0.298 


0.201 


0.227 


0.390 


0.346 


0.360 


0.168 


0.194 


0.334 


0.302 


0.360 


0.177 


0.163 


0.303 


0.329 


0.354 


0.193 


0.150 


0.294 


0.369 


0.332 


0.162 


0.147 


0.245 


0.305 


0.274 


0.132 


0.145 


0.197 


0.254 


0.211 



Table 6: RMSEs for the five procedures for different choices of m and n when a 
the actual model is M3 (left table) and M5 (right table). 



0.3 and 



(m, n) 


Pi 


P2 


Pi 


Pi 


Ps 


(5,5) 


0.173 


0.197 


0.342 


0.245 


0.314 


(5, 10) 


0.140 


0.159 


0.306 


0.297 


0.351 


(10, 10) 


0.126 


0.116 


0.247 


0.228 


0.319 


(10,20) 


0.117 


0.084 


0.216 


0.256 


0.282 


(10,50) 


0.129 


0.068 


0.203 


0.302 


0.248 


(20,50) 


0.110 


0.064 


0.170 


0.213 


0.194 


(50, 100) 


0.098 


0.060 


0.138 


0.164 


0.151 



Pi 


P 2 


Ps 


Pi 


Ps 


0.181 


0.232 


0.376 


0.277 


0.286 


0.153 


0.239 


0.370 


0.370 


0.287 


0.117 


0.203 


0.240 


0.336 


0.282 


0.093 


0.191 


0.198 


0.411 


0.266 


0.084 


0.168 


0.178 


0.465 


0.241 


0.060 


0.148 


0.127 


0.440 


0.175 


0.038 


0.139 


0.080 


0.402 


0.113 



hybrid possibilities are discussed in the concluding remarks section. 
5.2 Comparison with other procedures 

Next, we compare the proposed 1-parameter stump method to some competing procedures 
developed in the pharmacological dose-response setting to identify the MED. Most of the 
methods developed in dose response setting context are based on hypothesis testing pro- 
cedures. For example, Williams (1971) developed a method to identify the lowest dose at 
which there is "activity" in toxicity studies using a closed testing procedure based on isotonic 
regression for a monotone dose-response relationship. Hsu and Berger (1999) developed a 
step- wise confidence set approach to estimate and make inference on the MED. A nonpara- 
metric method based on the Mann- Whitney statistic incorporating the step-down procedure 
is investigated in Chen (1999), while Tamhane and Logan (2002) use multiple testing proce- 
dures for the task at hand. We compare our method with that of Williams (1971), of Hsu 
and Berger (1999) and of Chen (1999), referred henceforth as P3,Pi and P5, respectively. 

We fit the stump £<j with levels 1/2 and on cither side of the threshold d to the observed 
p-values (with To = assumed known) and compare the performance of our approach Pi 
with that of P 3 , P 4 and P 5 . A natural parametric procedure to estimate d° might be to 
fit a kink-type (hockey stick) model like Mi to the observed responses and estimate d (the 
threshold parameter) and the slope of the linear segment by the least squares method. We 
also implement this method and call it P2. Obviously when the true underlying regression 
function /i is not a kink-model this method might not be consistent, but given a finite sample 
it is often a good first approximation. Whereas, when p is a kink-function, e.g., when we 
assume the true model to be Mi, this approach should clearly outperform the other proce- 
dures. Indeed, Table 5 shows that P2 is very competitive for model Mi; still our approach 
Pi performs better for small sample sizes, e.g., (5, 5) and (5, 10). For the model M 2 , a slight 
departure from the model Mi, Pi mostly dominates P2, and all the other procedures. Note 
that as P 3 , P4 and P5 are procedures that are based on testing hypotheses, we need to specify 
a level (a), and in the simulations reported in the paper we have set a = 0.05. The choice 
of the a = 0.05 is purely based on classical hypothesis testing considerations; a proper choice 
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of the tuning parameter is not available. Changing a will change the RMSEs of P3-P5, and 
it is not quite clear how that will affect the estimation procedure and the RMSEs. Also, 
to implement P3-P5, we computed the cut-off values necessary to carry out the hypothesis 
tests using simulation, as such tables are not available for the different choices of to and n 
considered in this paper. 

Table 6 shows the performance of the methods when the true models are M3, the infinitely 
diffcrentiable regression function, and M5 . Clearly Pi dominates all the other methods when 
the data is generated according to M5; notice that P3 and P4 work with the underlying as- 
sumption that [i is nondecreasing, a condition which is violated in M5, and this explains the 
poor performance of these methods. Also, P2 is biased in this scenario, and thus the RMSEs 
do not converge to for large m and n. When M3 is the true model, Pi performs surprisingly 
well; this can be explained by looking at FigurelH the threshold value for M 3 is approximated 
very well by that of the kink-model. This is an artifact of the particular choice of the infinitely 
differentiable function, and the RMSEs can in general be very different if the fj, is not well 
approximated by a kink function. Overall, P\ is very competitive, and the simplicity of our 
approach coupled with its adaptivity to different types of mis-specifications, makes it a very 
attractive choice. We also note that the fitting of £<j does not require any tuning parameter, 
an obvious advantage over the testing based procedures. Indeed, one of the novelties of our 
approach lies in the fact that we treat the estimation of d° purely as an estimation problem 
and not a result of a series of hypotheses tests, thereby avoiding the need to specify a. 

5.3 Data Applications 

In this section, we apply the proposed procedure to the two motivating applications. Note 
that the first one corresponds to a rich allocation scheme in terms of (to, n), while the second 
application to a sparse one, thus showing the range of applicability of the procedure. Further, 
for the first application we discuss a subsampling mechanism that allows us to calculate con- 
fidence intervals for do, given the large number of doses available. This is not repeated for the 
gene expression data due to the paucity of time points, but for richer time course experiments 
it would become applicable. 

Queueing System: We consider a complex system comprising multiple classes of customers 
waiting at infinity capacity queues and a set of processing resources modulated by an external 
stochastic process. The system employs a resource allocation (scheduling) policy that decides 
at every time slot which customer class to serve, given the state of the modulating rate process 
and the backlog of the various queues. In Bambos and Michailidis (2004), a low complexity 
policy was introduced and its maximum throughput properties established. This canonical 
system captures the essential features of data/voice transmissions in a wireless network, in 
multi-product manufacturing systems, and in call centers (for more details see Bambos and 
Michailidis (2004)). As discussed in the introductory section, an important quantity of inter- 
est to the system's operator is the average delay of jobs (over all classes), which constitutes 
a key performance metric of the quality of service offered by the system. The average delay 
of the jobs in a two-class system as a function of its loading under the optimal policy, for a 
small set of loadings is shown in the left panel of Figure [T] These responses were obtained 
through simulation, since for such complex systems analytic calculations are intractable. 

We next employ the developed methodology for estimating both the loading d° and the 
unknown level To- Ten replicates of the response (average delay) were obtained based on 
5,000 events per class by simulating the system under consideration and after accounting for 
a burn-in period of 2,000 per class in order to ensure that it reached its stationary regime. 
The means per loading, YiS, are shown in Figure O 

We applied both procedures discussed in Section [3] with an unknown value for tq , assum- 
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0.05 0.1 0.15 0.2 0.25 0.3 0.35 



Figure 5: Plot of the average responses Yi. 

ing heteroscedastic errors. The first method (Method 1) gives an estimate of To, denoted by 
fo(l) = 2.61 with corresponding er(l) = 0.151, while the second method (Method 2) gives 
fo(2) = 2.52 with corresponding d°(2) — 0.117. It can be seen that there is fairly strong 
agreement between the two estimates. From the system's operator point of view the average 
delay of jobs exhibits a markedly increasing trend beyond a loading of 15%. For the first 
method, a plot of the P-values (left panel) and the criterion function (right panel) that is 
minimized in Theorem 3.2 are given in Figure El 

A second analysis, assuming homoscedastic errors, produces fairly comparable results: dr(l) = 
0.151 and f (l) = 2.59 and d°(2) = 0.128 and f (2) = 2.52, respectively for the two methods. 
Another question of interest to the system's operation is what level of uncertainty, as reflected 




x 



Figure 6: Plots of the estimated p-values using Method 1 (left panel) and the plot of the 
criterion function whose minimizer provides the estimate of To (right panel). 

through confidence intervals, can be assigned to these estimates. Obviously, our results es- 
tablish consistency of the threshold d° and level fo estimates, but no characterization of their 
asymptotic distribution has been provided, which would have resolved this issue. Neverthe- 
less, we outline a subsampling based procedure for partially addressing the construction of 
confidence intervals, provide some theoretical justification in the Appendix and finally discuss 
some open issues (see Section [5]) . 
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The steps in the employed procedure are: 

1. Sample m n vectors out of the 100 {Xi,Yi} pairs without replacement. 

2. For the collected sub-sample, compute estimates of d° via the two methods, denoted 
by d°'*(l) and <f°'*(2) respectively. Two versions of d°'*(l) are calculated: the first, 
denoted by <i '*(l,a) is calculated using the estimate of r based on the full sample 
(in other words, taking fo(l) as the "truth"), while the second denoted by d°'*(l,b) is 
computed after re-estimating To from the obtained subsample. 

3. Calculate the following statistics: t\ n = ml/ 3 (d°'*(l,a)-d°(l)),t^ n = ml/ 3 (d°<*(l,b)- 
dP(l)) and t% n = ml/ 3 (d°>*(2) - dP(2)) . 

4. Repeat the above 3 steps a large number of times (say B), storing the three statistics 
in the preceding step for each iteration, and obtain the empirical distributions for each 
of these statistics based on the B iterates, say {F* n } 3 =1 . 

5. Calculate q* 025 •„ and q* g75 the 2.5-th and 97.5-th percentiles of F* n respectively, 
for j = 1,2, 3." 

6. Prescribe [d°(l) - n~^ 3 q* g75 ;J> , - T» _1/s g*02B,i,n]> for j = 1,2, and 

[c?°(2) — n~ 1 / 3 q* g75 3 n ,d°(2) — n~ 1 / 3 q* 025 3 „] as approximate 95% confidence intervals 
for d°. 

Using m n — 50 and assuming heteroscedastic errors the 95% confidence intervals for d° when 
j = 1,2,3, arc (.133, .179), (.135, .158) and (.099, .125) respectively. Very similar results are 
obtained for m n — 75, while for smaller values of m n (e.g. 10 or 25) the resulting confidence 
intervals become exceedingly wide. Under the assumption of homoscedasticity the obtained 
95% confidence intervals are (0.133, 0.178), (0.151, 0.176) and (0.120, 0.139), respectively. 
It should be noted that the third confidence interval does not overlap with the second one, 
something that can be attributed to their fairly narrow respective lengths. From the system's 
operator perspective, it can be seen that running the system at loadings below 12% of the 
total capacity produces very small delays on the average for its customers, while at loadings 
larger than 18%, customers should expect to experience increasing delays. 

Time course analysis of the Transglutaminase 2 gene: This example deals with a 
time course experiment that studies the effects of cells treated with TGF-/3, a key cytokine 
implicated in a number of disease processes including cancer, on the epithelial-mesenchymal 
transition, a phenotypic conversion that enables cancer cells to attain their migratory and 
invasive capacities (see Keshamouni et al. (2009)). The data correspond to the expression 
levels of the Transglutaminase 2 gene measured in triplicate at 0, 0.5, 1, 2, 4, 8, 16, 24 and 
72 hours. This gene has been implicated in this mechanism through an elevated level at 72 
hours, compared to the baseline at hours (see Keshamouni et al. (2006)). The questions of 
interest are (a) at what time point its expression level rises from its baseline, since this pro- 
vides a timeline for the activation of this transition mechanism in cell-lines; (b) at what time 
point its expression level reaches saturation, which indicates that transition to malignancy is 
essentially complete. 

The average expression level of the 3 replicates is shown in Figure [7] and it can be seen 
that it rises from its initial baseline level and subsequently flattens out. It is therefore rea- 
sonable to postulate a model that is constant (r m i n ) till the first transition point d m m, then 
increases monotonically until it reaches the second transition point d max beyond which it 
remains constant at r max . Without loss of generality, for the analysis, the nine design points 
were taken as equispaced in the [0, 1] interval, since our interest focuses on identifying the 
stages where the changes occur. The procedure described in Section 0] was employed to esti- 
mate the parameters of interest with the p-values being computed under the assumption of 
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Figure 7: Average expression of the Transglutaminase 2 gene over time. 



heteroscedasticity of the error (as evident from the right panel of Figure [1} . For the base- 
line, Method 1 gives an estimate f m i n (l) = 3.974 with corresponding d m i n (l) = 0.4, while 
Method 2 gives f m i n (2) = 3.924 with corresponding d m j n (2) = 0.2. It can be seen that the two 
methods basically agree on the value of r m i n , but the second one identifies the change-point 
somewhat earlier than the first method. For the maximum level, Method 1 gives an estimate 
Tmax(l) = 4.367 with corresponding d TOax (l) = 0.6, while Method 2 gives f max (2) = 4.369 with 
corresponding d max (2) = 0.6, thus exhibiting perfect agreement. To estimate the minimum 
(baseline) value using Method 1 we minimized ([9]) over the restricted interval [min(Yi.),4] 
and for the maximum over the restricted interval (4, max(Yi.)] respectively, as advocated in 
Section [U The choice of the intervals do not matter so long they are disjoint and are contain 
the true minimum and maximum. We believe that the 0.4 time point is a more accurate 
estimate of d m i n , since the proteins encoded by the Transglutaminase 2 gene exhibit increased 
levels at 8 hours (the 0.5 design point) and nothing significant prior to that time, as obtained 
from a separate experimental platform described in Keshamouni et al. (2009). 

6 Concluding Discussion 

In this paper, we address the problem of identifying the threshold parameter at which a re- 
gression function diverges from its (possibly unknown) baseline value employing a p-value 
framework, under a controlled sampling setting. The p-values exhibit a natural dichotomy in 
behavior on different sides of the threshold, which is crucially used in our methodology. Our 
approach is computationally simple, nonparametric in nature and adaptive in the sense that 
it does not need any specification of the local behavior of the regression function around the 
threshold value. The procedure can also be used to prescribe estimates for the baseline value 
of the regression function. We establish asymptotic properties of the proposed estimators, 
extend them to the case of multiple change-points, study their finite sample behavior through 
an extensive simulation study, compare them for the MED problem to competing approaches 
and apply them to two real applications. The numerical results indicate that the procedure 
performs well in both high and relatively low signal-to-noise settings. 

We conclude with a discussion of a number of issues, some of which will be the focus of 
future work. In our setting, we dealt with the case of a balanced design with a fixed number 
of replicates m for every dose level Xj. The case of varying number of replicates m, can be 
handled analogously, although some care needs to be exercised in the technical arguments. 
Under the assumption that the minimum of the mj's goes to infinity, all consistency results in 
this paper continue to hold. A natural extension of the problem would be to investigate the 
performance of our method when we have a random design of points with no replicates, i.e., 
rrii = 1. Such problems arise in diverse contexts, e.g., in the astronomy application discussed 
in the Introduction, where the goal is to estimate the "tidal" radius. Note that in this setup, 
(approximate) p-values can be computed by binning the covariate space and averaging the 
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responses in each bin. 



The problem of constructing confidence intervals for the threshold d° is of considerable inter- 
est in many applications. Ideally, one would like to consider the asymptotic distribution of 
d m ,m as both m,n increase to infinity and use the quantiles of the resulting distribution to 
calibrate confidence intervals. However, as discussed in the previous section, this is a hard 
problem and its solution, presently unknown, is outside the scope of this paper. This consti- 
tutes a subject of future research. 

In the case of an unknown threshold To, our numerical results show a superior performance of 
Method 2 over its competitor, especially in small (to, n) scenarios. The poorer performance 
of Method 1 in these situations was tracked down to the inaccurate estimation of To in the 
simulation section. Note that consistency of the estimator is formally established for Method 
1. It would therefore be interesting to explore a hybrid procedure and its properties, where 
To is estimated using Method 2 and the resulting estimate is used as a plug- in in Method 1. 

Although we develop our method in a simple univariate regression setup, our approach, can 
be generalized to identify the "baseline" region in multi-dimensional covariate spaces. Fur- 
ther, for regression models in higher dimensions, the problem of estimating regions in the 
covariate space where the regression function stays below a pre-specified threshold can also 
be handled by our approach. Such problems, known as level-sets estimation, have been ex- 
tensively studied in the statistics and engineering literature (see for example, Singh et al. 
(2009), Willet and Nowak (2007)). Indeed, Section [2~2l of the current paper may be viewed 
as a level-set estimation problem, albeit in a simpler setting, but under minimal assumptions 
on the behavior of the regression function at the boundary. 

In conclusion we recall that the problem treated in this paper could also have been treated 
by direct estimation of the underlying regression function //. However, as briefly mentioned 
in the Introduction, straightforward intuitive estimates of the form d = ini{x : £l(x) > 0} 
underestimates d°; potential fixes lead to solutions for which very little is known in terms of 
asymptotic properties. Nevertheless, it would be fruitful to explore that line of research. 
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7 Appendix 

We start with establishing an auxiliary result used in subsequent developments. 

Theorem 7.1 Let {M^ : r £ 7~}$£Li be a family of (real-valued) stochastic processes indexed 
by h E H and let {M T : r G T} be a family of deterministic functions defined on H, such that 
each M T is minimized at a unique point h(r) € H. Here % is a metric space and denote the 
metric on % by d. Let h7 n be a minimizer o/M£. Assume further that: 

(a) sup Ter Bup h6W \Ml{h) - M T (h)\ ^Oand 

(b) For every n > 0, c(rj) = inf r inf^^ (fc(r)) {M T (h) — M T (h T )} > 0, where B v (h) denotes 
the open ball of radius rj around h. 

Then, (i) sup T d(h T n ,h T } converges in probability to 0. Furthermore, if T is a metric space 
and h T is continuous in r , then (ii) A h T ° , provided T n converges to tq. In particular, if 
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the 's themselves are deterministic functions, the conclusions of the theorem hold with the 
convergences in probability in (i) and (ii) replaced by usual non- stochastic convergence. 

Proof: We provide the proof in the case that % is a sub-interval of the real line, the case 
that is relevant for our applications. However, there is no essential difference in generalizing 
to the metric space case. Euclidean distances simply need to be replaced by the metric space 
distance and open intervals by open balls. 

Given i] > 0, we need to deal with P* {sup Tg7 - \h r n — h(r)\ > n}. We deal with outer 
probabilities to avoid measurability difficulties. The event A n>n = {sup Tg7 - — h{r)\ > r/} 
implies that for some r, h n ^ (h(r) — rj, h(r) + rj) and therefore 

M-(K) - M T (h(r)) > inf {AT(fc) - M T (h(r))} . 

This is equivalent to 

M T (hl) - M T {h{r)) - MUK) + Ml(h(r)) > 

inf {M-{h) - M T (h(r))} + M£(/i(r)) - M£(/£) . 

Now, M^(/i(r)) — M^(/i^) > and the left side of the above display is dominated by 

2 ||M; - M T \\ n = 2 sup \M T n {h) - M T {h)\ , 
hew. 



implying that: 



2||M; - M T \\ n > inf {M T (K) - M T {h{r))} , 

h<£(h(T)~ri, h(r)+r]) 



which, in turn, implies that: 

2 sup ||M; - MIk > inf inf {M T (h) - M T (/i(r))} = c(n) , 

t£T r<E ' h£(h(T)-ri ,h(T)+ri) 

by definition. Hence 

A n . v C {sup ||M£ — M T \\-u > c(r/)/2} . 

By assumptions (a) and (b), P* {sup rg7 - ||M£ — M T ||« > c(r])/2} goes to and therefore so 
does P*(A njTI ). □ 

Remarks: We will call the sequence of steps involved in deducing the inclusion: 



sup \K - h(r)\ > r, \ c sup \\M T n - M T \\ U > c(r?)/2 , 

as generic steps. Very similar steps will be required time and again at places in the proofs of 
the theorems to follow. We will not elaborate those arguments, but refer back to the generic 
steps in such cases. 

Proof of Theorem I2.lt We prove Part (b) of the theorem since Part (a) follows by an 
(easier) adaptation of the arguments needed for Part (b). Recall that in Part (b), we find 
the best fitting stump to the observed p-values Zi m , i = 1, 2, . . . , n. Letting = al(x < 

d) + pi(x > d) for 9 = (a, /?, d), we minimize 

n 

M m<n (9) -^{^-^(XO} 2 - ]T (Z m -a) 2 + (^™-/?) 2 (11) 

i=l i:Xi<d i:Xi>d 
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over 9 = (a,/3,d) G [0, l] 3 . Letting m , n = (d m ,„, /3 m ,„, d m ,n) = argmin 9e [ ,i]3 M m)n (0), we 
see that 9 m>n is a natural estimator of 0o = (0.5, 0, d°). Let || • || denote the ^-metric in R 3 , 
i.e, ||(a, 6, d)|| = max{|a|, \b\, \d\}. Note that as m changes, the distribution of Zi m changes, 
and so we effectively have a triangular array of i.i.d. random variables {(Xi, Zi m )}f =1 ~ P m . 
It suffices to show that 9 m ,n 0o as rn,n— too, i.e., given e,£ > 0, there exists A G N, such 
that for all m,n> A, P m {||0 m , n — O \\ > e} < £. 

Using empirical process notation, note that M mj „(0) = P n . m {Zi m — £.e(Xi)} 2 and define 
M m (9) = P m {Zi m — £g(Ai)} 2 where M m (9) can be simplified as 

M m (9) = / -j> m (x) - a} 2 px(x)dx + / j>m(x) - (3} 2 p x (x)dx + c m , (12) 
Jo Jci 

withc m = Jjff^ijpxlijiii, where a 2 ra (x) =Var(Z im |Ai = x). Note that cr^ (x) — > (l/12)l(x < 
d°) = <r 2 (x) as m — » oo. Let M(0) be the same expression for M m (9) in (fl"2"|) with v m [x) 
replaced by z/(x) = (l/2)l(x < d°) and c m replaced by c = J Q er 2 (x)px(x)dx, e.g., M(9) = 

f Q {i/(x) — a} 2 px{x)dx + Jj{i/(x) — P} 2 px(x)dx + c. Observe that M(9) > c for all 9, and 
M(9q) — c. Also it is easy to observe that 0o is the unique minimizer of M(9). Note that 
Mm{9) can be expanded as 

M m {9) = I v^ n (x)px{x)dx - 2a / v m (x)p x (x)dx 
Jo Jo 

- 2/3 / v m (x)p x {x)dx + a 2 / p x (x)dx + f3 2 / p x (x)dx + c m . 



Using a similar expansion for M(6), the difference |M m (0) — M(ff)\, can be bounded as 

r 1 



Jo 

i 

-2/3/ |z/ m (x) - z/(x)|px(a;)dx + \c m - c| 
< 8 / |f m (x) - i/(x)|px(a;)da; + \c m - c\ ->• 



uniformly in G [0, l] 3 , e.g., \\M m - M\\ x = sup ee[01]3 \M m (9) - M(9)\ -> 0. By Theorem 
[71] 9 m = (a m ,/3 m ,d m ) = argmin ee [ 0!l] 3 M m (9) ->• argmin ee [ 0> i]3 M{9) = O as m -> oo. 
Notice now that it is enough to show that for any e > 0, for some Mq (possibly depending on 

e), 

sup P m {sup||0 m,7i @m 

|| > e} — > 0, as k — > oo. (13) 

m>M n>k 

To see this, take any £ > 0. Then, by ([T^| and the fact that m — )■ 0o, there exists A G 
N, A" > M such that for all k > A, 

sup P{sup || || >e/2}<£, and||0 fc -0 o || < e/2, 

m>Mo n>fc 

which implies for n, m > K, 

P{\\9 m . n - O || > e} < P{\\9 m . n - 9 m \\ > e/2} < P{ sup ||0 m>n - m || > e/2} < £, 

n>K 

thereby completing the argument. 
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To show that (fTUl) holds, consider the class of functions T = {fg(x,z) = (z — a) 2 l(x < 
d) + (z- /3) 2 l(x > d)\6 = (a,(3,d) g [0,1] 3 } with the envelope F(x,z) = 1. Note that T is 
formed by combining three bounded VC classes of functions: {(z — a) 2 : < a < 1}, {(z — /3) 2 : 
< f3 < 1} and {l(x < d) : < d < 1} through finitely many operations involving addition 
and multiplication and therefore satisfies the entropy condition in the third display on page 
168 of van der Vaart and Wellner (1996). It follows that J- satisfies the conditions of Theorem 
2.8.1 of van der Vaart and Wellner (1996) and is therefore uniformly Glivenko-Cantelli for the 
class of probability measures {P m }, i.e., 

sup P m {sup ||M ro ,„ - M m \\oc > e} -> (14) 

m>l n>k 

for every e > as k —¥ oo, where || • ||oo denotes the supremum (uniform) norm over the 
function class. 

Fix e > and consider the event {||# m! „ — 9 m \\ > e}. Since 9 rn minimizes M m and 9 m ,n 
minimizes M mj „, by arguments analogous to the generic steps in the proof of Theorem 17.11 
we have: 

\\9 m ,n - 9 m \\ > e => ||M m ,„ - MmWoo > T] m (e)/2 , (15) 



where 



and 1 = (1,1,1)'. 



7? m (e)= inf {M m {9) - M m {9 m ) . 

9£[6 m -el,8 m +el}c 



Claim: There exists r/ > and an integer Mq such that rj m (e) > r\ > for all m > Mo. 
We assume the claim for the time being, which yields, 

P m {sup || II > e} < -Pm{sup ||M TO ,„ - MmWoo > ?y/2},Vm > M 

n~>k n>k 

and thus by (IT41 . sup m> ^ /o P{sup„ >fe \\S mi7l — 9 m \\ > e} — > as k — > oo . This completes the 
proof of the theorem. □ 

Proof of the Claim: Let us bound M m (9) — M m (9 m ) below as 

M rn (9)-M m (9 m ) = (M m -M)(9)-(M m -M)(9 rn ) + {M(9)-M(9 m )} 
> -2\\M m - M|U + {M{9) - M(9 m )} 

As ||M m — M | |oo — > as m —> oo, it is enough to show that there exists r/ > such that for all 
sufficiently large m, infg 6 [g m _ el .e m + e i]c{M(9) — M(9 m )} > r\. Note that the main difficulty 
arises because M(-) is not twice-diffcrcntiable at (1/2,0, d°). 

We split M{9)-M{9 m ) into two parts as [M{6) - M (1/2, 0, d°) }+{M (l/2,0,rf°) - M{9 m )}. 
Notice that by the continuity of M(-), the second term goes to 0. To handle the first term 
notice that M(9) - M(l/2,0,d°) = )*{v{x) - a} 2 p x (x)dx + J^{v(x) - /3} 2 p x (x)dx. 
There exists M a G N such that for all m > M , we have 9 m e [6 - (e/2)l,9 + (e/2)l] as 
9 m —> 9 . Observe that for 9 = (a, /3, d) £ [9 m — el, 9 m + el] c , m > M , and d > d°, we have 

M{6) -M(l/2,0,d°) =J Q - aj p x {x)dx + J (3 2 p x (x)dx 

+ J P 2 px{x)dx > J Px(x)dx>(^j kI 
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as 1 1/2 — a | > \a m ~ ot\ 
and d < d° , we have 



1/2| > e/2. Similarly, for 6> £ [0 m - el,0 m + el] c , m > M , 



M(6) -M(l/2,0,d°) 



— a I px(x)dx + 



P p x {x)dx 



+ 



f P 2 p x (x)dx >( I3 2 p x (x)dx > (^-Y kI 



,,n, 



as |/3| > |/3 m — j3\ — |/3 m | > e/2. Take 77 = /de 2 /4. This completes the proof of the claim. 

Proof of Theorem l3.2t We start with some notation. Let Wm = y/ritei^/ao, i = 1,2,.. 
and consider our i.i. d. "data" as: {X,, wi i} }™ =1 . Note that W$ has density <f> m (-). Let 
JPVi.m(') denote the empirical measure of these observables and P m the joint law of (X\, Wm )■ 
Let (To denote the true variance of ey, and let cr denote any such generic value. For a fixed 
a- > and /jgl define (with a slight abuse of notation): 



m(Yi. —tq) — h 



= 1 - $ 



(/J-(Xi) - to) - h + ypmti 



i=l 



Zl m {h) - - 



and note that h m n = argmin^ M.° m (h) = \frn(j., 
argmin^ M m (h) where 



m.n 



M m (h) = P n 



i-* 
2 



t ), where a = o-„, m . Let h m 
m(fi(Xi) - r ) - h + <ioW£ r 



px(x)dx 



With g% h (x, y)=® ^(M(*)-To)-h+a y ^ 



Let e, £ > be given. Letting P m denote the n-fold product measure of P m , we want to 
show that P m {\h m n — 0| > e} < £ for all large m and n. Subsequently, we will denote P m 
by P m (again in an abuse of notation), but it will be clear from the context whether we are 
alluding to the product measure. We bound the quantity of interest as 



Pm{K, n - 0| > e} < P m {\h m . n - h m \ > e/2} + P m {\h m - 0| > e/2}. 
We employ the following steps to complete the proof of the theorem: 



(16) 



Step 1: Establish that there exists So > and Mq > such that \a — <jq| < So and m > Mq 
implies \h m - 0| < e/2. 

Notice that as a is a consistent estimator of cro, there exists Mi such that for all m,n > Mi > 
0, P m {\a m ,n - <tq\ < S Q } > 1 - C/3. Therefore, using Step 1, P m {\h m - 0| > e/2} < £/3 for 
m > max{M ,Mi}. 

S^ep J?: Note that the first term on the right side in (fl6|) is bounded by 

Pm{\h m . n - h m \ > e/2, \a - a \ < 5 } + P m {\a - °o\ > S } 
< P m \ sup \h mtn -h m \ >e/2i+£/3 (17) 

I |er— cr 1<<5 I 
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for all n,m > max{Mo,-Mi}. Therefore, it is enough to show that for some M (possibly 
depending on e), 

sup P m < sup \\f m n - \f rn \ > e/2 > -> 0, as n -> oo. (18) 

m>M (jo— ct |<<5o ' J 

Proof of Step 1: We study the behavior of M^(h) as to — > oo. Note that g!^ h (x,y) —> 
$( ~ ft+ffoj/ ), if x < d°, and 1 if a; > d°, as to — > oo. Therefore, M^(h) converges point- 
wise (by an application of the dominated convergence theorem (DCT) along with Scheffe's 
theorem) to M a (h), where 



M a (h) = cl(h) / p x {x)dx + j I p x {x)dx < 1/4. (19) 

'd° 



1 

4 

where c1(h) = {1/2 — $ ((—h + a y)/a)} 2 (f>(y)dy. To see this, observe that 

{l/2 — g^i' l (x, y)} 2 4> m (y)dy, which is uniformly bounded by a positive constant for all 
to, x, can be decomposed as 

/oo 
{1/2 - g% h (x, y)} 2 {</> m (y) - ^{y)}dy (20) 
-oo 

where the first term converges to c\(h) for x < d° and to 1/4 for x > d°. The second term 
in (f2"0"]) converges to by Scheffe's theorem for all x € [0,1]. The convergence of M^(h) now 
directly follows from the DCT. Let h a = argminM CT (/i) for h E R. 

Claim 1: There exists 6' > 0, such that sup| (T _ (To | <5 , sup ftgR \M% l {h) — M a (h)\ — > as 
to — > oo. 

Proof of Claim 1: For notational convenience we will write $(x) (1 — $(x)) as $ (1 — $)(x) 
in what follows. Choose 6' such that < 5' < <jq. Let rj > be given. Note that 

,d° ,i 

MZ l (h)-M' T (h)=A% h Px (x)dx+ B^ h (x)p x (x)dx (21) 
where = |°° | ^ - $ ( — ± ^ ) } - <P){y)dy and 

To simplify notation, denote the set {(<r, h) : | cr — cr 1 < S' , h E M.} by C. Then, 

sup \M?Jh) - M»| < F x (d°) sup \A% h \ + sup f \B% h (x)\p x (x) dx . 

C C C Jd° 

Now, sup c \A'^ l \ < \4> m — 4>\(y)dy — > by Scheffe's theorem, and 

1 1 2 1 

2 -9™ h {x, y)> 4>(y)dy-~ 



\B% h (x)\ < / \</> m -<l>\{y)dy + sup 
, c 



= o(l) + sup fj (I - S) ~ f ~ k + ^ ) t(y)dy . 

Now, 

sup/ |^' l (x)|p x (x)dx= (sup f \B% h (x)\px(x)dx) vfsup / |B£ h (x)|p x (x) dx J , 
c Jd° yc< J> y yc> Jd° ) 
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where C<o and C>o are denned analogously to C, but with h varying over (—00, 0] and (0, 00), 



respectively. Since, for each x > d°, sup c<0 $ (1 — $ 



/ ^/m(fj,(x)-T )-h+a y 



<t>{v)dy is 



easily seen to be dominated by sup| CF _ crn | <<5 / /^(l — <f>) f :^E(MgL T o)+< J oy ^j ^(y^dy which goes 

to as 771 — y 00 , it follows readily that the first term on the right side of the last display is 
o(l). It remains to deal with the second. To this end, for A, h > 0, define D^ h = {d° < x < 
1 : \fi(x) — (to + h/ \fm)\ < A}. Given r\ > 0, there exists A = \(r\) > (but not depending on 
h > 0) such that f D \,h Pxix) dx < rj by Assumption (A) of Theorem 13.21 Then, 



sup 



d" 



B% h (x)px(x)dx 



< sup 




c 





B% h (x)p x (x)dx 



< 7] + o(l) + SUp 



$(1 - $) 



+ sup 


1 




J[d°,l] 


— hm~ 


1/2) + 


a 



C> J[d°S]-D^ 

The last term in the above display is readily seen to be bounded by 



B% h (x)p x (x)dx 

h 

) 4>{y)dy px{x)dx 



max < $ 



sup 

C>o J[d«,l]-D 

which, in turn, is no larger than 



A 



mA + aoy 



sup max <^ $ — — 1,(1-$) 

[d°,l] J-OO a£[a -S>,cr a +S>] IV a 



77i\ + aoy 



(t>{y)dy px{x)dx 



4>{y)dy px(x)dx 



and this can be made less than 77 for sufficiently large m. It follows that 

sup c>0 J, \B!^ h (x)\px(x) dx < 3 7/ for all sufficiently large m and Claim 1 follows. 

Claim 2: There exists there exists <5o > and Mo > such that for all a with \<r — ao\ < 80 
and m > M , \h a m - 0| < e/2. 

Proof of Claim 2: This will be proved by a direct application of Theorem 17.11 In that 
theorem, take n to be m, T to be the set |cr — cr | < 8' and H to be R. Also, is now 
and M T is now M CT . We will show that M a is uniquely minimized at a point, say h a , and 
also that inf| CT _ (TQ |< 5 / infi/,_j,<ri > ^(M <7 (/i) — M cr (h a )) > for every 7] > 0, whence, by Claim 
1, it will follow that sup\ a _ ao \ <s , \h^ — h a \ converges to with increasing m. But, as will 
also be seen, h? equals for all a and hence Claim 2 follows with 80 taken to be 8' . 

From the form of M CT (/i) (see [19]) it suffices to show that inf| CT _ CT0 |<$/ Lnf|/ l _k<r| >T( (cJ'(7i) — 
Ci(h a )) > 0, where h a is the unique point at which c\ is minimized. We now make some 
change of variables to facilitate the ensuing argument. Define A = ct/cto and s = h/uo- Then 
\a - cro\ < 5' <=> |A - 1| < 8" (for some 8" < 1) and $((-& + a y)/a) = ^A" 1 ^ - s)). 
Defining 

"1 



<f>(\-i(y-s)) 



<Piy) dy , 



it suffices to show that inf| A1 | <d -'/ inf | s _ s a| >J7 / CTo (c^(s) — (s\)) > where s A is the unique 

minimizer of c^. It is easy to see that c\ (s) = E — < &(A^ 1 (2' — s))] ~ where Z is a standard 
normal random variable. By the symmetry of i? about 0, it follows easily that c\ (s) = c^(— s). 
Furthermore (s) is strictly increasing for s > 0, and is therefore strictly decreasing for s < 0, 
showing that is the unique minimizer of c\. Hence s\ = for all A, showing that h a = 
for all <7. Thus, 



inf inf (<£(*)- = inf „(c£fa/tr ) - #(0)) . 

|A-1|<(5" |s-s A |>» 7 /o'o |A-1|<<5 
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Since Ci(r]/ao) — Ci(0) is continuous and positive for each A, its infimum on the set |A — 1| < 5 , 
which must be achieved, is positive. 

Proof of Step 2: Consider the class of functions J-^ = L) m T m where T m = {fh,o-(x, w ) = 
{1/2 - ^(^(^(x) - r )/a + h/a + w^-)} 2 \t £ R, a £ [er - S ,cr + S }}- This is a subclass 
of the large class of functions Q = {g a .p n (x, w) = [1/2 — <f>(a^(x) + (3w + j)] 2 \(a, j3, 7) £ R 3 }. 
Note that the class {afi(x) + f3w + 7} as (a,f3,-f) varies in R 3 forms a finite dimensional 
vector space of measurable functions and is therefore VC. Hence, 1/2 — &(a/j,(x) + f3w + 7), 
being a bounded monotone transformation of a VC class of functions, is bounded VC and 
consequently, so is the class Too- Thus, Too satisfies the entropy condition in the third display 
on Page 168 of van der Vaart and Wellner (1996) and therefore the conditions of Theorem 
2.8.1 of van der Vaart and Wellner (1996) and is uniformly Glivenko-Cantelli for the class of 
probability measures {P m }, i.e., for any given £ > 0, 



sup P m {sup ||M^ fc - AO^x} -> as 

m>l fe>~ 



Tl — > OO 



and therefore, 



sup P m {sup WM^k - M,iy m > C} -> as n -> 00. (22) 

m>l k>n 



Next, using techniques similar to that from proving (|15D , we can show that 

sup |/C,„ - h" m \ > e/2 =* ||M^„ - ilCIK > r? m (e/2) (23) 

\<r— ero|<<5o 

where r? m (e) = inf | CT _ CTn |<«5 w£\ h ^\ >e/2 {M^(h) - M^(/i^)}. 

Claim 3: There exists 77 > and an integer M such that r} m {e) > ^ > for all to > M. 
Proof of Claim 3: By Claim 2, for all sufficiently large to, uniformly for a £ [ctq — So, cro+So], 
we have [h^ — e/2, h a m + e/2] c C [— e/4, e/4] c . We conclude, that for all sufficiently large m, 

Vm(e) > fj m (e) ee inf inf {M£(*0 - M£(*m)} . 

\<t-(t \<S |h-0|>e/4 

For h and a such that |/i — 0| > e/4 and |ct — cto| < 5o, we can bound M° n {h) — M^(/j£j below 
as 

M^{h)-M^ m ) = (M^-M")(h)-M^ m ) + M"(h) 

> - sup Kil^-AOWI-AOOHM^/i) 

|/i-0|>e/4 

> - sup sup \{M^-M a ){h)\ 

|o--<t |<5o |ft-0|>e/4 

sup |(M--M CT )(0)| 

I (7— <To|<6o 

+ inf inf \M a (h) - M CT (0)1 

| CT - CTo |<« |A-0|>e/4 

As 8 up k _ CTO |< 5o sup |ft _ 0|>e/4 |(M£ - M°){h)\ -> and sup |ff _ ffn| <, |(M£ - Af CT )(0)| -> as 
to — » 00, and inf| cr _ (To |< ( 5 inf|/ l _o|> e /4[M cr (/i) — M a (0)} —: 2 77 > 0, it follows that for all large 
to, r) m (e) > rj > 0; therefore, for all sufficiently large to, say m > M, rj m {e) > r/ > 0. This 
completes the proof of the claim. 

Hence, for all to > M, 

sup P m I sup fh^n - h° m \ > e/2 I < sup P m (sup ||M^ ifl - M^ m > T] m {e)/2 

m>M (j<x-<x |<<5o J rn>M lk>n 

< sup P m (sup ||M^ fc - M^\\ Tm > 77/2) 
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and (US) follows from (1221) . 



□ 



Proof of Theorem I2.2t For notational simplicity, we refer to d mjTl and <x mi „ in this proof 
as d and a respectively. We borrow notation from the proof of Theorem 13.21 but note that 
now (j is a constant (as opposed to being a function). We seek to show that for given e > 0, 
P m {\d — d°\ > e} — > 0, as to, n —> oo. Define: 

m a i {Z lm) X 1 ) = {Z? m (ro) - 1/2} 2 1(X 1 < d) + {Z° m (r Q ) - O} 2 ^ > d), 

and let Af£(d) = P m [m°(Z lm , X x )} and M^ >n (d) = P n , ro [m a d {Z Xm , Xi)}. Also, define d™ = 
argmin d M^(rf) and <>™ = axgrmndM^Jd). 

Step 0: Establish that for m sufficiently large, sup^.^^ |d™ — d°| < e, for a pre-assigned 
e > 0, where So > is a small number chosen to depend on e. 



This quantity of interest is dominated by: 

P m {\d-d?\ > e/2} + P m {\d% - d°| > e/2}, 



(24) 



where d = aT ,n . By Step and the consistency of a for a, the second term in the above 
display can be made (arbitrarily) small for all sufficiently large to. By the consistency of a, 
again, it suffices to show that for some M and Sq > chosen appropriately, 



SU Pm>M Prr 



sup K 

|o"-er |<<5o 



> e/2 > ->• 0, as to 



The above display is comparable to (fT5)l in the proof of Theorem l3.2l and follows via arguments 
similar (in fact, much simpler) to those following (|18[) . involving uniformly Glivenko-Cantelli 
classes of functions. We omit the details and in the remainder of the proof, focus on estab- 
lishing Step 0. 



Recall how Wm was defined in the proof of Theorem 



We denote the distribution of 



Wm by <& m and note that $ m converges weakly to Note that 



M*(d) 



(TO 



,-1/2 



1 - $ 



(TTO -1 / 2 



px(x)dx 



rv m 



The point- wise limit of M^(d) as to — > oo is given by 

M°Yd) = I Cl foP x ( x ) dx + C2 id Px(x)dx, for d < d°, 
1 ci J d p x (x)dx + 1/4 f da p x (x)dx, for d > d°, 



(25) 



(26) 



where ci = ^{1/2- <S>(cr y/ a)} 2 (f)(y)dy < - $(<J yAr)} 2 <?Ky)dy = c 2 . We first show 



that 



sup |M£(d)-M CT (d)| 

|<T-o-o|<<5 ,de[0,l] 



(27) 



as to — > oo for some <5q > sufficiently small. 



Choose Sq such that <tq > Sq > 0. To show that the convergence is uniform in d G [0, 1] 
and a e [ao — So, a + Sq], we define g m (r], A) = _E($(A + r\ Wm')), for A e R and ry > 0. For 
each fixed to g m {j], A) is continuous in A, converges to as A —> — oo, to 1 as A — > oo and is 
strictly monotone in A. 
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Also define g(r],\) — E($(\ + rjZ)), where Z is a standard normal random variable. Let 

e > be given. By Assumption (6) we can get ft > such that fl +/3 px(x)dx < e. Note 
that by Assumption (a) of the theorem there exists 77 > such that /j,(x) — tq > rj for all 

x > d° + (3. Let S m = $ (^Wi 1 ^ and S = $ {^Z), where Z ~ N(0, 1). After some 

simplification, using (US]) and (JUJ), we can bound |M£(d) - M a (d)\ for d < d° by 

{^(l/2-5 m ) 2 -S(l/2-5) 2 } / p x (o;)da ; + {^(l-5 m ) 2 -^(l-5) 2 } f p x {x)dx 



E 



1 - $ 



-1/2 



px(a;)(ia;. 



We can analogously bound \M^(d) - M a (d)\ for d > d°. Therefore, to show that ([27]) holds, 
we need to show that 



sup 

|ct-o- |<<5o 



E 



a 



-E 



<t>[^Z 



sup 

|cr-cr |<5 



E 



\ — 1/9 _ Fr m 

Verm 1 ' z cr 



1 



0, and 
0, 



(28) 
(29) 



as m — > 00. To show that (|2"5|) and (|29p hold, we first observe that 

\g m (v, Kn) ~ g{v, A)| < |flw(f7, A m ) - 5(77, A m )| + 15(77, A m ) - g(i], A) I 

where A m — > A as m — > 00, with A = or A = 00. The second term on the right side can be 
controlled by using the continuity of g{rj, A) for a fixed 77, and the first term can be bounded 
by noticing that 

sup |ffm(??,A) -g{r},\)\ 

AeK,J76[l-«o,l+Ko] 



< E 



sup 

AeR,r)G[l-K ,l + Ko] 



&{\ + flW$) -$(A + T7Z) 



as m — > 00 



via an application of the DCT, where Wm^ can be assumed to converge almost surely to Z 
(using Skorohod embedding). This proves (|27]l . 

Note that argmin^g^^] M"(d) = d°. Using techniques similar to that used in proving (p~5]) . 
we can show that 

sup |d™-d°|>e^ sup \M^(d)~ M a (d)\ >??(e)/2 (30) 

|<t-(To|<<5o \a-a o \<S o ,de[0,l] 

where 77(e) = inf \ a - ao \<s inf \ d ™-d<>\>e{M a (d™) - M a {d )} > (follows from (J26J) ) . But, as 
([27| holds, (|30| cannot hold for all large m, thereby completing the proof of Step 0. □ 



Proof of Theorem I3.lt Letting Zf m {r) as in ([7]) (in Section [3]) and m^' T (Zi m , X\) = 
{Zf m (r) - 1/2} 2 1(A! < d) + {Zf ro (T) - 0} 2 1(X! > d) , define 



l^ n (d) = F njm [m^ T {Z lm) X 1 )} and M£ T (d) = P m [m a /{Z Xm , X x )\. 



(31) 



Define d™;™ = argmin de[0jl] M^ n (d) and d£f = argmin de[0)1] M% T (d) and note that d m , n = 
d™~\ Let e > be given. We seek to show that: P m {\d m>n — d°\ >t)->0asm,n-> 00, to 
which end it suffices to show that both P m (\d m , n - d™ To | > e/2) and P m (|d™ To - d°| > e/2) 
go to 0. The second term is easily handled on noting that sup| cr _ CT0 | <(5n |d™ TQ — d°\ < e/2 for 
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some appropriately chosen So (using Step of Theorem 12.21) and the consistency of a m ,n for 
(To- To handle the first term, we show that 

Pm( sup - d£J > c/2 ] ^0, 

\\<r—<ro\<So ) 

and this, again in conjunction with the consistency of <J m ,n implies the convergence of the first 
term to 0. We next introduce some notation. For a real- valued function x a defined on [0, 1], 
define \\x a \\ = sup| o ._ o . Q | <d - sup dg [ ,i] l a;0 '(d)l> where <jq > So > 0. By arguments analogous to 
generic steps, we have: 

Pm l sup | - C,r 1 > e/2 ] < P m (\\M%% - M%*\\ > Vm (e)/2) , 

\|<r— <r |<5o / 



with r/ m (e) = inf| CT _ CTO |< 5o inf| d _ d m T j >e/2 {M£' T °(d) - M% T °(d™ To )}. By using arguments 
similar to that of Claim 3 of Theorem 13.21 and the Claim of Theorem 12.11 (following (TT5)) ), 
we can show that r) m (e) > rj > for all sufficiently large m, whence it suffices to show that 
P m (\\M^ n - M^ T °\\ > rj/2) goes to as m,n -> oo. Now, 

l|MJ& - MZr\\ < \\M%% - M«$\\ + ||M™ - M%*\\ . (32) 

Using arguments (involving universal Glivenko-Cantelli classes) similar to the proof of Step 
2 in TheoremGH we can easily show that: sup m > 1 P m (sup fe >„ \\M^ - M% T °\\ > »?/4) -> 
as n — > oo, whence it readily follows that P m (||M$£ T ° — M^ T °\\ > r]/4) — > as m, n -> oo. 
That P m (||M£>f„ - M^°\\ > 7]/ 4) goes to follows from the fact that ||M£f n - M^|| < 
K ^/m(f m ,n — tq) which goes to by assumption, as m, n oo. This last inequality follows 
from the fact that for any d £ [0, 1]: 



in . a 



(d)-M^(d)\ < - J2 3|^ m (f) - Z? m to)\ + i J2 2|^ m (f) - Zf m (r )| 
n ■* — ' n * — ' 

i:Xi<d i:Xi>d 

< ~ E i^(r) - z?m\ < ro) \ t m)/°- 

i—1 i=l 

< g ^^U o (33) 

(T 

for some universal constant X, whence -fT can be taken to be K/(ao ~ Sq). □. 



Justification of the subsampling procedure: We consider an asymptotic paradigm, 
where m is viewed as fixed, and n as increasing to infinity in the setting of a known to 
(which can then be taken to be without loss of generality). Let the setting be that of 
Theorem 12.11 and consider fitting a stump £d(%) — (1/2) l(x < d) as a working model for 
v m . Recall that the underlying setting corresponds to a regression one with i.i.d. obser- 
vations {Xi, Zi, n }f =1 . Letting P m denote the distribution of (Xi, Zi iTn ), the best-fitting 
stump in the population is characterized by the parameter d m := argmin M m {d), where 
M m {d) = P m [(Zi, m - (1/2)) 2 l(Xi <d) + Z\ m \{X X > d)}. Setting the derivative of M m with 
respect to d, to zero, yields the normal equation v m (d m ) = 1/4. Under reasonably modest 
assumptions on the underlying model, d m is unique and provides an upper bound on d° and 
the larger the m, the tighter the bound. A level 1 — a confidence interval for d m can then 
be used as a surrogate for a level 1 — a confidence interval for d°. Letting d m ^ n be the least 
squares estimate of d° obtained by minimizing Y^i=i {{Zi, m — 0.5) 2 < d) + Zf m l(Xj > d)\ 
over all d, it can be shown, by adapting the techniques of Banerjee and McKeague (2007), 
that for a fixed m, n 1//3 (d m ^ n — d m ) converges to a continuous symmetric but non-Gaussian 
distribution, namely Chernoff's distribution (studied in Groeneboom and Wellner (2001)), 
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as n — > oo. As this distribution depends upon hard to estimate nuisance parameters in the 
model, subsampling without replacement (as in Politis, Romano and Wolf (1999)) or the m 
out of n bootstrap can be used to extract asymptotic confidence intervals for d m . Owing to 
the non-standard nature of the asymptotics involved, the standard Efron-type bootstrap fails 
in this situation (see Sen, Banerjee and Woodroofe (2010) and references therein). 

Note that the implemented subsampling procedure, while relying on the above results in 
spirit, does take some liberties in its implementation. Firstly, the Method 1 based confidence 
intervals require estimation of r since it is unknown for our application. Secondly, the theo- 
retical results above are not immediately applicable to the confidence intervals using Method 
2 which involves a non-trivial modification of the p-values used in Method 1. Nevertheless, it 
seems reasonable to conjecture the same n 1 / 3 rate of convergence to d m for the least squares 
estimates obtained by this approach, with a non-degenerate continuous distribution, which 
would then validate the use of subsampling. 
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